Jump to content
  • entries
    143
  • comments
    451
  • views
    174,012

Queue timing


mos6507

362 views

This weekend Delicon and I were on a quest to finalize the queue functionality, mainly to minimize the latency in processing queue functions.

 

There are three discrete queue functions.

 

READ

WRITE

SEEK

 

Also, there are two groups of queues. FAST queues which run out of the ARM's internal RAM. There are only 16 of these, hence 4K out of the 8K of ARM RAM. Then we have SRAM queues, which can use up to 64K of space. The way we're limiting the hotspot footprint, we currently plan on two "banks" of 100 hotspots representing the queues, so they would not use up all of the 64K of memory. The way that works is still subject to change.

 

The ideal is for a queue read to be over and done with in the course of a single LDA $XXXX instruction. The reason this is important is in a kernel, sometimes you have to do two successive writes like this:

 

LDA QUEUE0

LDX QUEUE1

STA GRP0

STA GRP1

 

So the queue indexing function must complete processing by the time the 2nd load command starts up.

 

It's a tall order to have the ARM multitask the queue indexing function in the background. Even though the next value won't be needed until the next scanline, the ARM must index forward on the queue that was just read before the read starts for the different queue because it just won't be able to schedule the indexing for later. It's now or never. In the case of a really busy kernel there won't be much downtime for the ARM to catch up anyway, leading to unpredictable results. The kernel timing is make or break for what the 2600 can do graphically so it's important not to crimp anyone's style.

 

As currently designed, reads work the same way as writes. The ARM won't know whether a write took place or not. It will read the "cached" value out of the CPLD and write it back to the queue stack. So if it's just a read, it writes the same value back. This adds extra overhead to the process.

 

Writes are not as timing critical because they would generally happen outside of the kernel and done via the SC writing method, there is plenty of time for the indexing to take place before the queue can be read or written again. Same thing with seeks because you write to seek.

 

With our current design, if you load/store load/store, the fast queues work fine. But you can't load/load store/store. It appears to be over the limit by about 1 CPU cycle.

 

Also, with SRAM queues, the timing is much worse because the SRAM is hanging off the CPLDs so the ARM has to talk through the CPLD to get at it.

 

I was able to get the existing demo from last week to run with SRAM queues, but in a real-world scenario, you would not want mandatory NOPs in the middle of your kernel.

 

So one thing we're going to do is implement a write-protect toggle for the queues. By being able to tell the ARM whether or not writes are enabled, the ARM can switch to a fast read-only mode for queues when responding to CPLD interrupts.

 

Another problem currently hurting us is the cramped board layout which Delicon says is creating a lot of noise in the signal. This noise is actually forcing the ARM to wait until it subsides when responding to certain signals.

 

So the plan is to go through another board revision where the traces are better separated to avoid the noise. This will be easier to do because it will feature a microSD slot instead of MMC, so there will be more real-estate to work with.

 

So the combination of the write-protect and the elimination of noise, Delicon pledges that fast queues will work for load/load store/store. SRAM queues are to-be-determined. 16 queues are probably enough by themselves for most games. So if SRAM queues will only work with 3 cycles between reads then it will probably be okay. But I'm hoping it doesn't come to that.

 

BTW, for those of you less technically inclined who are wondering what the heck this is all about, don't worry too much. When the hardware is finalized, we'll put a lot of effort into explaining how this works in terms that are easier to understand. It's just that right now we're pushing the boundaries of the hardware still to see what we're going to be able to support. When decisions have to be made we're trying to settle on a design that will feel relatively straightforward. We're really trying to prevent having to force the programmer to jump through too many hoops and avoid too many pitfalls to use these features. If it starts going in that direction we're doing something wrong and either have to find another way or drop the feature entirely.

2 Comments


Recommended Comments

When the code does a read from a queue, what's the exact sequence of events? Since I haven't seen any indication that you use a last-read strategy (where each queue read will return the results of the last queue fetch and then fetch a value for the next one), I would think that if you can complete a queue read within 800ns (one cycle) you should have no problem getting the indexing done before the next read (which will start at least three cycles after the end of the first one).

 

Also, in continuing a thought from another thread, would there be any possibility of putting magic hotspots at $0040-$007F? If the two cycles following that hotspot come from cartridge memory, output the last requested byte on the data bus during the last half of the second such cycle. If timing allows, the byte fetched from memory in the first half of that cycle could be used to determine the next queue operation.

 

You might need the CPLD to assist with some of the timing, but I would think things should be manageable. The normal minimum time between queue accesses would be five cycles--one more than the time between two back-to-back loads from the queue--but the code would get a free TIA store in that time.

 

Also, is there any reason not to support magic writes? The they're much more convenient than Supercharger-style from a 6507 programming perspective, and I'd expect them to be more convenient from the cartridge-implementation standpoint as well.

Link to comment

I would think that if you can complete a queue read within 800ns (one cycle) you should have no problem getting the indexing done before the next read (which will start at least three cycles after the end of the first one).

Its tough to get things done in time. I am limited by bus width (SRAM is 128K, 17bit addresses), the ARM response time to interrupts, and noise. I will make the timing requirements, I just need to work though some of this and eliminate some board noise. I switched from an oscillator to a crystal (trying to save a couple bucks) and unfortunately picked up a lot of noise in the process. I have talk over some strategies with an expert, but there are no concrete solutions, this is a kind of black magic. Many times problem are eliminated through trial and error.

 

Also, is there any reason not to support magic writes? The they're much more convenient than Supercharger-style from a 6507 programming perspective, and I'd expect them to be more convenient from the cartridge-implementation standpoint as well.

We have the components on the board to support the magic writes, just havent had time to test that feature out. Right now the resistors are being used as terminators on the VCS data lines. This suppresses almost all of the noise leaked to the VCS (screws up the picture bad, but not with the terminators).

Link to comment
Guest
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...