Queue timing
This weekend Delicon and I were on a quest to finalize the queue functionality, mainly to minimize the latency in processing queue functions.
There are three discrete queue functions.
READ
WRITE
SEEK
Also, there are two groups of queues. FAST queues which run out of the ARM's internal RAM. There are only 16 of these, hence 4K out of the 8K of ARM RAM. Then we have SRAM queues, which can use up to 64K of space. The way we're limiting the hotspot footprint, we currently plan on two "banks" of 100 hotspots representing the queues, so they would not use up all of the 64K of memory. The way that works is still subject to change.
The ideal is for a queue read to be over and done with in the course of a single LDA $XXXX instruction. The reason this is important is in a kernel, sometimes you have to do two successive writes like this:
LDA QUEUE0
LDX QUEUE1
STA GRP0
STA GRP1
So the queue indexing function must complete processing by the time the 2nd load command starts up.
It's a tall order to have the ARM multitask the queue indexing function in the background. Even though the next value won't be needed until the next scanline, the ARM must index forward on the queue that was just read before the read starts for the different queue because it just won't be able to schedule the indexing for later. It's now or never. In the case of a really busy kernel there won't be much downtime for the ARM to catch up anyway, leading to unpredictable results. The kernel timing is make or break for what the 2600 can do graphically so it's important not to crimp anyone's style.
As currently designed, reads work the same way as writes. The ARM won't know whether a write took place or not. It will read the "cached" value out of the CPLD and write it back to the queue stack. So if it's just a read, it writes the same value back. This adds extra overhead to the process.
Writes are not as timing critical because they would generally happen outside of the kernel and done via the SC writing method, there is plenty of time for the indexing to take place before the queue can be read or written again. Same thing with seeks because you write to seek.
With our current design, if you load/store load/store, the fast queues work fine. But you can't load/load store/store. It appears to be over the limit by about 1 CPU cycle.
Also, with SRAM queues, the timing is much worse because the SRAM is hanging off the CPLDs so the ARM has to talk through the CPLD to get at it.
I was able to get the existing demo from last week to run with SRAM queues, but in a real-world scenario, you would not want mandatory NOPs in the middle of your kernel.
So one thing we're going to do is implement a write-protect toggle for the queues. By being able to tell the ARM whether or not writes are enabled, the ARM can switch to a fast read-only mode for queues when responding to CPLD interrupts.
Another problem currently hurting us is the cramped board layout which Delicon says is creating a lot of noise in the signal. This noise is actually forcing the ARM to wait until it subsides when responding to certain signals.
So the plan is to go through another board revision where the traces are better separated to avoid the noise. This will be easier to do because it will feature a microSD slot instead of MMC, so there will be more real-estate to work with.
So the combination of the write-protect and the elimination of noise, Delicon pledges that fast queues will work for load/load store/store. SRAM queues are to-be-determined. 16 queues are probably enough by themselves for most games. So if SRAM queues will only work with 3 cycles between reads then it will probably be okay. But I'm hoping it doesn't come to that.
BTW, for those of you less technically inclined who are wondering what the heck this is all about, don't worry too much. When the hardware is finalized, we'll put a lot of effort into explaining how this works in terms that are easier to understand. It's just that right now we're pushing the boundaries of the hardware still to see what we're going to be able to support. When decisions have to be made we're trying to settle on a design that will feel relatively straightforward. We're really trying to prevent having to force the programmer to jump through too many hoops and avoid too many pitfalls to use these features. If it starts going in that direction we're doing something wrong and either have to find another way or drop the feature entirely.
2 Comments
Recommended Comments