Chimera Queues - Brainstorming discussion

Delicon · November 16, 2006

Some possibilities:

Have a hotspot to clear a queue. I.e., write the queue-number to hotspot CLEAR and that queue is zeroed out.

Even better: be able to fill a queue with a value other than zero (write the queue number to the hotspot first then write the value).

Even better: be able to fill *regions* of a queue with a certain value - i.e., write $9F to the first 100 spots in queue #2.

Holy grail: what TJ requested: filling part of a queue with values (plural) from ROM. So you have sprite graphic in ROM from $F000 to $F00F and you fill spots 32-47 in queue #3 with those graphics.

All of these are possible. I guess I dont understanding why you only want to use a single queue for everything. Use 50 queues for 50 sprites, then you dont ever have to erase data and write it back in. I dont see filling queues at power on as an issue. But if thats the case then you can just have them loaded for you by the ARM from any file, including your source.

I dont have a problem with any of these suggestions, I am just trying to get a better understanding. I am mostly ignorant to VCS programming, so please be patient.

Vern

Delicon · November 16, 2006

Another problem would be, that you need at least one queue for each sprite (multiplied with animation frames). So if a sprite is 10 pixels tall on average, your approach wastes 246 bytes (or 96%) of space. So 128k SRAM shrinks to 5.2k if you only use it for sprites.

But your not actually using code space for queue storage. You cant put code in the queue space, so you arent wasting space. You have 64K for your immediate code use, infinite amounts of code can be swapped in over that 64K. The other 64K is off limits anyway. Thats only for queue storage.

Vern

mos6507 · November 16, 2006

I echo TJ, there has to be a way to populate the queues *very* quickly. Automatically would be best.

Some possibilities:

Have a hotspot to clear a queue. I.e., write the queue-number to hotspot CLEAR and that queue is zeroed out.

Even better: be able to fill a queue with a value other than zero (write the queue number to the hotspot first then write the value).

Even better: be able to fill *regions* of a queue with a certain value - i.e., write $9F to the first 100 spots in queue #2.

Holy grail: what TJ requested: filling part of a queue with values (plural) from ROM. So you have sprite graphic in ROM from $F000 to $F00F and you fill spots 32-47 in queue #3 with those graphics. Fill automatically, that is, by writing, say, the initial queue position, the final queue position, and the address of where to pull the data to fill those positions.

So, using above example:

write 32

write 47

write $F000

Done!

Doable?

What you are describing is a block move command similar to how we're planning to do disk IO. I think this may work well although the limiting factor will still always be SRAM access speed.

+batari · November 16, 2006

I like the built-in flickersort ideas, and also I'd suggest a single command that would allow left and/or right bitwise rotation of several queues at once (and into one another.) This would make things like left/right scrolling work very well, whether we wanted to scroll a large playfield, a 48-pixel sprite or 96-pixel bitmap. It wouldn't bother me if this kept the ARM busy for a few scanlines.

Aside from that, the most important feature for me would be good documentation for those who want to write their own ARM code, and of course an easy way to upload this code to the cart.

Delicon · November 16, 2006

I like the built-in flickersort ideas, and also I'd suggest a single command that would allow left and/or right bitwise rotation of several queues at once (and into one another.) This would make things like left/right scrolling work very well, whether we wanted to scroll a large playfield, a 48-pixel sprite or 96-pixel bitmap. It wouldn't bother me if this kept the ARM busy for a few scanlines.

Aside from that, the most important feature for me would be good documentation for those who want to write their own ARM code, and of course an easy way to upload this code to the cart.

I am certain that I can fit in the bit shifting of queues, mainly because the shift operation in the ARM in free. So you shouldnt need to wait a couple scan lines, it should be available at the next read.

I started work on a bootloader yesterday, so updating the ARM will be as easy as copying a hex file to the SD. The bootloader is proving to be a bit tricky, as most are, but in any case, Philips offers a very simple tool that loads code onto all of their ARMs through the serial port.

As far a documentation goes, I have no plans to start that anytime soon, but things will be fully documented.

Vern

Thomas Jentzsch · November 16, 2006

All of these are possible. I guess I dont understanding why you only want to use a single queue for everything.

Atari 2600 programmers are all heavily conditioned to writing very optimized code (CPU time and space). So we probably just feel pretty uncomfortable when wasting that much space. At least I do. To me it feels like a brute force attack to overcome the 2600 limitations, not like an elegant way of extending. So I pretty much just instinctively try to optimize as much as possible.

But that's probably just the result of too much Atari 2600 programming.

Use 50 queues for 50 sprites, then you dont ever have to erase data and write it back in. I dont see filling queues at power on as an issue. But if thats the case then you can just have them loaded for you by the ARM from any file, including your source.

How long would loading from file need?

I dont have a problem with any of these suggestions, I am just trying to get a better understanding. I am mostly ignorant to VCS programming, so please be patient.

I am even more ignorant to what you are doing now, so please be even more patient.

Thomas Jentzsch · November 16, 2006

But your not actually using code space for queue storage. You cant put code in the queue space, so you arent wasting space. You have 64K for your immediate code use, infinite amounts of code can be swapped in over that 64K. The other 64K is off limits anyway. Thats only for queue storage.

Ok, then I am wasting queue space.

So I am limited to 256 single colored sprites (probably less), right? Or 128 multi colored ones. Minus the queues needed for the playfield graphics and colors. Minus more queues e.g. for enabling missiles or the ball, sizing and maybe even moving them (e.g. the zombies in Glenn's Death Derby WIP). Hm, might get pretty tight. :ponder:

I wonder how many sprites where used in Reindeer Rescue. There where quite a lot and many with some animations. Though only a few multi-colored ones (but mainly due to limitations the Chimera should help to overcome).

Delicon · November 17, 2006

Ok, then I am wasting queue space.

Space can only be wasted if it can otherwise be used for something else. The cart has a ton of memory put aside just to be 'wasted' for queues. If you dont use queues at all, then you are wasting 68K of memory.

So I am limited to 256 single colored sprites (probably less), right? Or 128 multi colored ones. Minus the queues needed for the playfield graphics and colors. Minus more queues e.g. for enabling missiles or the ball, sizing and maybe even moving them (e.g. the zombies in Glenn's Death Derby WIP). Hm, might get pretty tight.

Are there VCS games that have more than 256 sprites? That seems like an awful lot to me. If there is a game that use 256 sprites and doesnt use this cart, then the same game should be able to run a whole lot more sprites with the cart.

Delicon · November 17, 2006

How long would loading from file need?

Disk access no matter what, takes forever in CPU time. I havent measured the throughput of my SD FAT filesystem, so I cant be sure. But there is no reason that a game needs to stop executing during a load. The VCS still has full access to its code and banking during that time.

But if you needed speed we could do a direct load from the running code to the queues if that would help. That should be pretty fast. Not a couple cycles fast, but fast.

Vern

mos6507 · November 17, 2006

To try to put the issue of memory waste into perspective, game code can use up to 64K at a time, maybe a little less depending on how we implement the menuing system. Right now we're planning on game-driven multiloads so the menu app probably doesn't have to stay in the 64K space all the time.

So there is another 64K dedicated to SRAM queues. (Game code can not execute from there.) There is only 4K on top of that dedicated to fast queues. So while the SRAM queues probably can't have as many special features of the fast queues, you can make up for it by "wasting" space.

If you had a sprite with 20 frames of animation, you might need 40 queues (20 for the graphics, 20 for the color). Yes, each queue would be mostly blank space. Each frame you would figure out which pair of queues to use, and seek them so that the sprite will show up in the proper scanline range. The kernel itself would be extremely simple. It would look something like this:

LDA animation_frame
LDA QUEUE_0,X
STA GRP0
LDA QUEUE_20,X
STA COLUP0
STA WSYNC

This is not a very realistic kernel because it's just showing one sprite, so the speed advantages are not apparent. If you wanted to remove the relative addressing, you could poke the queue address into the kernel during VBLANK (self-modifying code).

One thing this does do, however, is make it extremely simple for the novice to get a sprite up and running. Many games could be created by just using a few basic boilerplate kernels and doing all the work by manipulating the queues.

Where this starts to break down is if you want to reuse a sprite in a kernel. Still, I can think of ways to continue using "dumb" kernels. Imagine a 3rd queue where you store the queue ID offset for every scanline. Now you can program in which queue gets read on every scanline.

LDX QUEUE_40; queue picker
LDA QUEUE_0,X
STA GRP0
LDA QUEUE_20,X
STA COLUP0
STA WSYNC

This is all done with no conditional branching whatsoever. This is not taking into account horizontally repositioning the new instance of the sprite. That will require knowing when to do that operation, but that flag can also be stored in a queue:

LDA QUEUE_60; reposition flag
BNE do_reposition
LDX QUEUE_40; queue picker
LDA QUEUE_0,X
STA GRP0
LDA QUEUE_20,X
STA COLUP0
STA WSYNC
rts

do_reposition
...
rts

So what this amounts to is unrolling conditional logic into queues. It takes a drastic change of programming philosophy because it's so wasteful of space, but when you're talking about so much space, it may not be that limiting.

With the addition of block-move commands, obviously, everything could be much more efficient. Portions of the 64K SRAM queue area could be reserved as a "data bank" of graphics information rather than queues per se. Then you'd have a small amount of queues (maybe a dozen) that you would actually use in the kernel. Rather than switching between queues and adjusting the start indexes, you would tell the ARM to copy the data into the queues on each frame. It's just that with larger datasets, the ARM may take too much VBLANK time performing these operations against SRAM. ARM RAM is a different story, and since you have 16 queues to work with, you could tell the ARM to copy SRAM data into ARM RAM queues.

I think what Delicon was suggesting as an alternative is you would read from a queue, but the actual queue data could come from some other physical place. This avoids having to block-copy data around, self-modify code, or use relative addressing.

What I'm trying to demonstrate, though, is that there are many ways to use queues, and many ways to accomplish the same basic task of sprite movement and animation.

Also remember that queues are not solely useful for regular sprites. I was envisioning the SRAM queues as being more useful for scrolling playfields or storing large digital audio. Using the seek function to scroll makes more sense when you use up more of the queue itself with data as you could with taller sprites or playfield graphics. For instance, it would be ideal for something like a looping/scrolling barrier that you can shoot at like in a Gunfight game.

One of the things we envisioned was how to implement queues larger than 256 bytes. We were thinking of hardcoding larger queues at certain hotspot addresses sort of like this:

0 0 1 2 3 4
1 _ 1 _ _ _
2 _ _ 2 _ _
3 _ _ _ 3 _

Vertically, the number represents the queue hotspot ID. Horizontally represents the underlying memory allocation. Queue0 would be 1K in size, and overlap the queue area of queues 1-3. The advantage of this arrangement is you would still be able to tell queue 0 to seek anywhere. Queue0's seek hotspots would be "linked" with all the other queues it overlaps. When you seek Queue1, you are also telling Queue0 to seek to that portion of Queue1. This way the seek hotspots still only require a single write, but you can randomly access queues larger than 256 bytes.

So I can see how SRAM queues may find a niche with vertically scrolling games where you would use some of the reserved queue IDs that are larger than 256 bytes, maybe several 4K or 8K queues, and you could have games that are more like Xevious than River Raid quite easily.

Remember that if you find that you are running out of space, if your game has discrete levels, you just tell the ARM to overwrite the queues with new data loaded from flash between levels. And since the load mechanism has no fixed size, you could pack your data tightly on disk. But it does seem like a no-brainer to support block moves from memory to memory.

Comments?

vdub_bobby · November 17, 2006

I don't understand the comments about writing to the queues once on poweron and then never worrying about them again - what if the sprites move vertically?

And if you reuse a sprite multiple times during the kernel, then they need to move vertically independently from each other.

So I don't think there is any way to avoid having to erase and rewrite sprite data into the queues; probably every sprite on every frame.

So I'd say block moves from memory to queue are almost a necessity.

And, just for reference: Reindeer Rescue was 16K and had:

-10 animation frames for Santa, with a separate color table for each frame *for each level* (4 levels) = 50 queues

-16 single-frame, single-color sprites that appeared on the floor

-13 multiple-frame, single color sprites (100 frames total!) that appeared in the sky

-a fullscreen scrolling PF (would use 6 queues)

So, using a single queue for each sprite and the playfield would have required 172 queues alone And that's only for a 16K game!

If you want to see something really ambitious out of the Chimera I think you need to support block moves.

mos6507 · November 17, 2006

I think ultimately people should visualize the queues as a form of literal video memory organized into columnar bitplanes. That is how the fullscreen modes that the ARM supports are going to work.

So you might have one bitplane for shape data, one for color data. With the sprites, you have shape, color, size, and horizontal movement. So you'd start to move away from drawing to the display via a customized kernel, to manipulating this bitmap memory, as you would on the Atari 8-bit home computers. For simpler games, it will allow developers to get a reasonably colorful display up and running quickly.

The maximum number of queues you would likely use in a given frame would be the total number of graphics registers in the console. You'd never reach that maximum on any one scanline because there isn't enough time, even with the benefit of queues. So the sheer number of queues we're supplying is MASSIVE OVERKILL. This wastes memory we were obligated to include on the cartridge by virtue of SRAM memory pricing, and has virtually no impact on the amount of space available for mainline 6507 assembly. This has big advantages over, for instance, a Supercharger-like RAM bitmap coexisting in the cartridge address-space. And since I've already seen Supercahrger bitmaps do wonderful things as it is, this should be big net-gain regardless of implementation.

In theory I think we should offer special queue features for the fast RAM, but with every new way to customize how the queues behave, you need to go beyond theory and establish a specific API to configure the queues. It quickly becomes too complicated to implement by reserving special hotspots. The more hotspots you have the more it eats into your address-space. (As it is we're already working on a hotspot toggler to "hide" the queues so you can get at the underlying memory.) We would have to have a hotspot "command string window" and have the ARM decode these. And to configure a single queue via multi-byte commands with a lot of options is going to be really slow with Supercharger writes. So for just determining when to display a single sprite, it's a lot more CPU efficient to just seek the queue and reserve padding. To make it even faster we're working on "seek groups" so if you chain different attributes of a sprite (color, shape, width, hmov) then you can seek the group with a single write operation. This would also work well to vertically scroll background graphics.

mos6507 · November 17, 2006

I don't understand the comments about writing to the queues once on poweron and then never worrying about them again - what if the sprites move vertically?

For a 200 scanline game, your sprite graphics lives between scanlines 201-208. The rest of the queue is blank. If your queue index is initialized at 0, the sprite will be just offscreen at the bottom. To move up vertically, you seek the queue forward. If you need multiple queues (for other synchronized sprite priorities) then you just seek those queues also (probably with the group-seek feature).

In the lame demo video I posted, the scrolling is simply a matter of incrementing these indexes between frames. It does not require that the data be rewritten into the queue at a different position.

I agree that this technique is not suitable for all cases, but it is still useful.

This is all assuming you seek queues only during VBLANK. If you start trying to seek them in mid screen then you can conserve more memory or do more interesting things. Depending on the latency of queue seeks, in an Oystron type game, where there is vertical separation, when the sprites are being repositioned, the appropriate queue and start index could be chosen. In a kernel like that, the full height of the queue never gets used anyway, so there wouldn't need to be as much padding and you could pack more shapes onto individual queues. It would work well for scores also. You could store 0-9 in six queues like a slot machine and quickly seek to the right position for each queue. Or in a split-screen game like Motorodeo or Xenophobe, you reuse the same queues, and just do all your queue seeking in the middle of the screen. Another thing you can do is decouple one queue from another. If you define a rainbow pattern in one queue and graphics in another, you can scroll the graphics in one direction and the rainbow in the other just by seeking. No overwriting of queue data required.

Edited November 17, 2006 by mos6507

Thomas Jentzsch · November 17, 2006

So, using a single queue for each sprite and the playfield would have required 172 queues alone And that's only for a 16K game!

Thanks for the numbers. And probably you would have needed a lot more queues for the scrolling playfield than just 6. If that part would have been doable with the current queue design at all.

If you want to see something really ambitious out of the Chimera I think you need to support block moves.

Yup. At least if the queues remain designed as they are now (I am not giving up soon ).

And while it might be nice, that coding the kernel gets easier with Chimera, this is not what ambitious homebrewers are looking for. We want new opportunities, we want to create games which where completely impossible without it.

Thomas Jentzsch · November 17, 2006

To try to put the issue of memory waste into perspective, game code can use up to 64K at a time, maybe a If you had a sprite with 20 frames of animation, you might need 40 queues (20 for the graphics, 20 for the color). Yes, each queue would be mostly blank space. Each frame you would figure out which pair of queues to use, and seek them so that the sprite will show up in the proper scanline range. The kernel itself would be extremely simple. It would look something like this:
LDX animation_frame
LDA QUEUE_0,X
STA GRP0
LDA QUEUE_20,X
STA COLUP0
STA WSYNC

Yup, that's how the code would look like. But since registers are extremely scarce in the 650x, it would be better not to require loading X.

One thing this does do, however, is make it extremely simple for the novice to get a sprite up and running. Many games could be created by just using a few basic boilerplate kernels and doing all the work by manipulating the queues.

Which is IMO pretty irrelevant for the target group of the product.

So what this amounts to is unrolling conditional logic into queues. It takes a drastic change of programming philosophy because it's so wasteful of space, but when you're talking about so much space, it may not be that limiting.

Understood.

It's just that with larger datasets, the ARM may take too much VBLANK time performing these operations against SRAM. ARM RAM is a different story, and since you have 16 queues to work with, you could tell the ARM to copy SRAM data into ARM RAM queues.

16 queues would be enough, if they are efficient.

BTW: Instead of copying, flagging like the DPC would work here too. Or queue combining, either by ANDing two queues (like I described above) or by letting one queue control the iteration of the other one. E.g. the controlling queue would contain 0, 0, 0,...0, 1, 1, 1..., 1, 0, 0, 0..., where 0 means do not iterate the controlled queue.

I was envisioning the SRAM queues as being more useful for scrolling playfields...

But exactly there somewhat intelligent queues are a must. With completely stupid queues, one 200 lines screen, using just PF1 and 2, requires 800 bytes (4 queues). For each additional horzitonal screen, you need another 800 bytes. And if you add vertical scrolling, you soon need more than 1 queue/column. A 10x10 screens large scenery, requires 8 vertical queues by 40 horizontal queues. That's 320 queues! And if the scenery is not 100% static, it gets even worse. Therefore even with so much memory available, we need intelligent queues.

Remember that if you find that you are running out of space, if your game has discrete levels, you just tell the ARM to overwrite the queues with new data loaded from flash between levels. And since the load mechanism has no fixed size, you could pack your data tightly on disk.

Sure.

Thomas Jentzsch · November 17, 2006

In a kernel like that, the full height of the queue never gets used anyway, so there wouldn't need to be as much padding and you could pack more shapes onto individual queues.

Didn't think about that yet. So you could use the memory more efficiently then.

But the unfleixble handling problems would still remain. Since I couldn't put all sprites into one queue, I still had to switch queues inside the kernel. Either by indexing queues, like you described above, or (better!) by assigning a different queue to the same hotspot as Vern described.

But then why not make the final step and allow a queue to point at any address inside the 64k area? Even queues limited to e.g. a 4K area would work, then the unused 4 bits could be used for control commands.

mos6507 · November 17, 2006

Which is IMO pretty irrelevant for the target group of the product.

There is more than one target group, but to address your concern...

Right now we are evaluating a better ARM chip that has more RAM on board (over 30K). In addition to that, Delicon says he would be able to double the access speed to the SRAM. That would probably satisfy what I can see are growing expectations. The only thing we really need from that chip is the RAM. It has a lot of other junk on there that is totally overkill, but there is nothing inbetween. Consequently, it would boost the cost of the cart by over $2 net, assuming we eliminate the dedicated flash chip and use the 512K flash for both ARM firmware and embedded games (this constraining the maximum size of embedded homebrews). It is not pin-compatible with the 2103 so we would not be able to fully cost-reduce it for something like a run of Leprechaun carts. The choice of the ARM chip is probably the biggest controllable factor in the cost of the cart. I didn't want to have different levels of Chimera in the field. I liked the idea that anyone who just bought a homebrew cart thinking it's just like any other cart (for relatively the same price) winds up getting a fully functional Chimera in the process (with the built-in game initially write-protected). I was hoping to make 200-300 of these things all at once so I wanted them to appeal to end-users.

So the original idea was for this to be a Supercharger replacement cart first and the extra stuff would be a lucky stroke of serendipity (which we would exploit to the fullest of course). The current board has a lot of bells and whistles like the EEPROM and clock cart that are not really essential. Unfortunately these are also not that expensive so they don't save you a heck of a lot by leaving them out. The original cart wasn't even going to have removable storage. The only features beyond the core Supercharger that I considered to be absolute requirements were the dual serial ports and the built-in flash. The serial port had to be there to load games anyway, and the ARM has two UARTs so it was foolish not to expose both (and allow the VCS to use them effectively).

We're really at a crossroads because we know we have to do at least one more board revision, and it really should be the final one. The next board revision is going to enable the two expansion ports to be configured as 2 extra atari-compatible controller ports. It looks like there is going to be a 3-4 week turnaround on the new boad design, so I was hoping to lock down the ARM chip soon. I'm just not sure we have enough information yet to know whether using the LPC2103 is going to be a regretful decision or not.

I don't know how many extra 9.2 boards Delicon has, but if you want an evaluation board to experiment on, please let us know. That way you can use these features for yourself and report back whether it actually is too constraining or not.

The SRAM access speed is always going to be the weakest link in the system. Delicon has done a great job to make the most of it, though. I am hopeful that most of the kinds of things you want to do with queues will be possible with the fast queues one way or another. Like I said before, what needs to happen is to distill these ideas into an API. If you want flagging or masking or some other feature, how would you like the VCS to "program" the queues given that you have 256+16 of them (far more than the DPC)? What kind of data structures in memory would the ARM be required to store these settings? Every extra setting may consume an extra byte of ARM RAM for every queue, or if stored in SRAM for SRAM queues, slow them down. Would these settings span a range of queues instead? Technical challenges require creative solutions, both from the software and hardware angles. So let's brainstorm.

mos6507 · November 17, 2006

BTW: Instead of copying, flagging like the DPC would work here too. Or queue combining, either by ANDing two queues (like I described above) or by letting one queue control the iteration of the other one. E.g. the controlling queue would contain 0, 0, 0,...0, 1, 1, 1..., 1, 0, 0, 0..., where 0 means do not iterate the controlled queue.

OK, how about we go even higher-level. At design time you build your data chunks representing your graphics (including all associated properties, not just the shape). You load in these objects and "register" them with the ARM. Then you just tell the ARM that for a given display you want object 0, 1, 2, and 3 and you want them here, there, there, and there. The ARM populates the queues for you, and even does the flickersorting on overlap conditions. At that point, almost all the work involved in the kernel is being done inside the ARM. The ARM should have enough time during VBLANK to do all that work. It's not that much worse than the text mode stuff we intend to do. The only limiting factor is how much graphics you can pack into the ARM RAM.

Edited November 17, 2006 by mos6507

Thomas Jentzsch · November 17, 2006

OK, how about we go even higher-level. At design time you build your data chunks representing your graphics (including all associated properties, not just the shape). You load in these objects and "register" them with the ARM. Then you just tell the ARM that for a given display you want object 0, 1, 2, and 3 and you want them here, there, there, and there. The ARM populates the queues for you, and even does the flickersorting on overlap conditions. At that point, almost all the work involved in the kernel is being done inside the ARM. The ARM should have enough time during VBLANK to do all that work. It's not that much worse than the text mode stuff we intend to do. The only limiting factor is how much graphics you can pack into the ARM RAM.

Hm, that sounds very tempting. But probably we would loose a lot of flexibilty that way. And flexibility was one major key to the 2600 big success.

And then programming the Chimera would become pretty boring.

Thomas Jentzsch · November 17, 2006

If you want flagging or masking or some other feature, how would you like the VCS to "program" the queues given that you have 256+16 of them (far more than the DPC)?

Either by one hotspot for each function and parameter (which would block a lot of ROM space). Or three generic hotspots who select the queue, the parameter type and then the parameter. This would be three writes for each parameter and 6 hotspots (one for fast and one for slow queues). Instead of 3 hotspots, maybe just one would work, if the ARM can interprete the input queue.

Both options have their advantages and disadvantages. Maybe a compromise would be the best solution. E.g. controlling the fast queues with dedicated hotspots for each queue, but a set of generic hotspots for the other queues.

:idea: Or maybe just putting a 8 or 16 bit address into the ARM, which points to a set of parameters. This would probably a good solution as wou only have to write one or two bytes. Maybe you could even provide a parameter queue, so when it gets triggered, it goes to the next set of parameters. This would only require a hotspot read (which could reside in the ZP) and just take 4 (or 3) cycles.

Or...? :ponder:

What kind of data structures in memory would the ARM be required to store these settings?

For one queue, we would need the address it points to (up to 16 bits), then one or two control bytes, which define how the queue works (flagging, logical operations, iterations, queue combining) and finally the parameter bytes.

Every extra setting may consume an extra byte of ARM RAM for every queue, or if stored in SRAM for SRAM queues, slow them down. Would these settings span a range of queues instead?

You mean, if one intelligent queue replaces a lot of dumb ones? Definitely! IMO.

Not all queues have to be intelligent though, but each of them should be controled independently.

Technical challenges require creative solutions, both from the software and hardware angles. So let's brainstorm.

I thought we are doing that right now.

And I hope some more homebrewers will voice their requirements and/or opinions here.

Edited November 17, 2006 by Thomas Jentzsch

vdub_bobby · November 17, 2006

Just occurred to me - using the queues for fullscreen PF (horizontal) scrolling is absolutely NOT going to work, because:

PF0 only uses the top 4 bits of the register - so when you scroll into or out of PF0 you need to scroll into the 4th bit in one case (into the 7th bit in the other, which is no big deal) and you need to scroll the 3rd bit out (NOT the carry!) in one case.

Not to mention the fact that unless you can bitshift an entire queue into another that it would be almost impossible to use the queues for a scrolling PF display.

Actually, just to illustrate this, here's what has to happen to horizontally scroll a fullscreen PF (like in Reindeer Rescue):

   ldx #PLAYFIELDHEIGHT
Loop  
  rol queue7,X;feeder data - not displayed
  rol queue6,X
  ror queue5,X
  bcc NoBitToRollIn
;--roll in a bit
  lda queue4,X
  ora #%00001000
  bne RollPF4;PF4 = PF0 on the right side
NoBitToRollIn
  lda queue4,X
  and #%11110000
RollPF4
  sta queue4,X
  rol queue4,X
  ror queue3,X
  rol queue2,X
  ror queue1,X
  dex
  bpl Loop

And every 8 scrolls you have to completely refill queue7.

The scrolling in Reindeer Rescue, which used about 60 bytes of RAM, took over half of the overscan to execute, I think. It took a long time.

I think there should be 8 queues specifically dedicated to the PF. They would have these properties:

-They could be bitshifted at the queue level (without having to bitshift each value in the queue)

-They would bitshift into each other *like the playfield works* (for reflected and nonreflected?)

-The 1st and 8th queues would hold the data to be shifted in/out of the playfield on the left and right (and queues 2-7 would be the displayed playfield.

Is this at all possible?

EDIT: Or maybe better - being able to specify 8-queue blocks as "PF queues"

Because if you were going to be really fancy you might want to flicker the PF and so you'd need two complete sets of PF queues. Think of TJs mockup here: http://www.atariage.com/forums/index.php?showtopic=38734

EDIT II: Completely unrelated thought regarding the necessity of block memory moves to the queues: if you wanted to play digitised music (like in Pitfall II), you would need to feed data into the queue at an *enormous* rate. You have to pull data and stuff it into AUDVx once per scanline, all the time - that's more than 250 bytes per frame! If you had to feed 250 bytes into the queue every frame, plus the computational time you need to stuff AUDVx every scanline (during the overscan, etc.) and you'd barely have time to do anything else at all!

Edited November 17, 2006 by vdub_bobby

mos6507 · November 18, 2006

One of the features that Delicon and I were discussing a while back was to just allow the game to bank all 128K. That way if you wanted to write to a queue, instead of using the seeks, you could bank in the 2K chunk it lives on and work with it in a traditional way, then bank it back.

We were also looking at a way to reconfigure the hotspots so you could open up a 256 byte 'window' to ARM RAM.

This would simplify the process of updating queues, and maybe provide a better way to message-pass with the ARM vs a stack-like approach.

So after seeing what lots of memory can do with playfields without Chimera in something like Superbug, I think a hybrid approach where both the ARM and the 6507 contribute to the problem might work well.

----

Audio is a whole other discussion. With Delicon working on the PWMs, we may be able to offer something comparable to Pitfall II. No matter how you slice it, however, feeding AUDVx on every scanline is going to cramp your style.

I was thinking, however, that the queues could be used to store piano-roll type tracks that play back similar to Paul Slocum's music driver. That way you would "hide" the entire musical score behind a small number of queue hotspots. These would only have to be read once a frame instead of every scanline.

supercat · November 18, 2006

Not to mention the fact that unless you can bitshift an entire queue into another that it would be almost impossible to use the queues for a scrolling PF display.

Rather than merely having different hotspots for the different queues, I was thinking it might be useful to have a 'queue list' queue. Successive reads from the main queue hotspot would grab data from the different queues in the required sequence. This approach might also make it practical to better configure attributes for the various queues.

To allow side-scrolling, I'd allow queues to be configured with a few fun modes:

Bit-grab mode: A read from the queue will put a specified bit (0-7) into a special "carry" flag. To avoid wasting time, the read could return something other than the queue data to the 6502 (e.g. the output of the audio generator).
Left full mode: The value read from the queue will be shifted left a bit before being put on the data bus, with bit 0 supplied from the "carry" flag, and with bit 7 going into the carry flag. Like the ROR instruction.
Right full mode: The value read from the queue will be shifted right a bit, as with the ROL instruction.
Left half mode: The value read from the queue will be shifted left, but the carry flag will go into bit 4.
Right half mode: The value read from the queue will be shifted right, but carry will be taken from bit 4.

If the display is used in mirrored mode, the latter two options could be eliminated. Alternatively, if the cart can support magic writes, the processor could do "rol"/"ror" instructions itself. Not as fast, but it would reduce overhead.

On a related note, especially if magic writes are supported, it may be useful to include a "copy" mode in which the "read" half of a cycle comes out of one queue and the "write" half goes into another. When scrolling into a new group of 8 pixels, the copy mode could be used to make the new data appear instantly from a "permanent" copy; it should then be shifted and copied into the temporary copy.

Unfortunately, this approach would be much better suited to left-right scrolling than vice versa, since it would require "upstream" data to be read first. There'd be some ways of dealing with that, but it would be somewhat nuisancesome.

supercat · November 18, 2006

Audio is a whole other discussion. With Delicon working on the PWMs, we may be able to offer something comparable to Pitfall II. No matter how you slice it, however, feeding AUDVx on every scanline is going to cramp your style.

Indeed so, although putting the data, plus one, on D0-D3 during a read of AUDV0 would help minimize the cramping since it would allow a "DEC" instruction to update AUDV0 in 5 cycles without trashing the accumulator.

+batari · November 18, 2006

No matter how you slice it, however, feeding AUDVx on every scanline is going to cramp your style.

I think every other scanline would still work reasonably well. A sampling rate of ~7.5kHz sounds fine to me as long as we don't try to represent high fundamental pitches or lower pitched timbres for which the harmonics decay slowly. But it will still make programming difficult. Title screen music shouldn't be a problem, but in-game music will be a pain.

Still, before I knew about the RIOT timers, I programmed a playable game that didn't use them at all, though I fixed it once I learned how the timers worked.

Edited November 18, 2006 by batari

Chimera Queues - Brainstorming discussion

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members