Jump to content
IGNORED

Chimera Queues - Brainstorming discussion


mos6507

Recommended Posts

Still, before I knew about the RIOT timers, I programmed a playable game that didn't use them at all, though I fixed it once I learned how the timers worked.

 

Back in 1994, I gave up on the game that would later become Strat-O-Gems becaue I couldn't figure out the RIOT timer and thought it would be too miserable to try to count scan lines while searching for triples. A lot of the kernel is actually from 1994.

Link to comment
Share on other sites

You like the concept? I wonder why nobody did it "back in the day"?

I think its just lucky that it works. The 6507 keeps the data on the data bus for only a few ns after it changes the address bus for the next operation. It is only because of that tiny overlap that it can work. Unfortunately you cant time the writes, you have to snatch that data the moment the VCS alters the address bus.

 

Vern

Link to comment
Share on other sites

I think its just lucky that it works. The 6507 keeps the data on the data bus for only a few ns after it changes the address bus for the next operation. It is only because of that tiny overlap that it can work. Unfortunately you cant time the writes, you have to snatch that data the moment the VCS alters the address bus.

 

Actually, with a bus-hold circuit it's quite solid. The 6507 outputs data soon after the start of phi2, and doesn't change the address until after the end of phi2, so there's over 100ns of leeway.

Link to comment
Share on other sites

I think every other scanline would still work reasonably well. A sampling rate of ~7.5kHz sounds fine to me as long as we don't try to represent high fundamental pitches or lower pitched timbres for which the harmonics decay slowly. But it will still make programming difficult. Title screen music shouldn't be a problem, but in-game music will be a pain.

I have a partial disassembly of Pitfall II and it shows that keeping the sampling rate constant is really a pain. Not only inside the kernel, where it "just" costs 7 cylces/scanline but especially outside the kernel.

 

BTW: 7.5kHz would only allow frequencies of less than 4kHz.

Link to comment
Share on other sites

I think every other scanline would still work reasonably well. A sampling rate of ~7.5kHz sounds fine to me as long as we don't try to represent high fundamental pitches or lower pitched timbres for which the harmonics decay slowly. But it will still make programming difficult. Title screen music shouldn't be a problem, but in-game music will be a pain.

I have a partial disassembly of Pitfall II and it shows that keeping the sampling rate constant is really a pain. Not only inside the kernel, where it "just" costs 7 cylces/scanline but especially outside the kernel.

 

BTW: 7.5kHz would only allow frequencies of less than 4kHz.

I know, but 4 kHz (3.75 to be precise) is quite high-pitched. Here's a sample:

 

ftp://ftp.tek.com/tv/test/streams/Element...dio/4kz-064.mp2

 

EDIT #2: I recall a demo being posted to the Stella list few years ago that used 7.5 kHz and was 100% software driven. Was it pitsound.bin, or is it something else I'm thinking of?

Edited by batari
Link to comment
Share on other sites

Actually, with a bus-hold circuit it's quite solid. The 6507 outputs data soon after the start of phi2, and doesn't change the address until after the end of phi2, so there's over 100ns of leeway.

I noticed, through the logic analyzer, that some write instructions take longer than others. This definitely happens in digdug. Some times a write instruction will hold for two vcs cycles with the vcs applying its data near the end of the second cycle. Other times its just a single cycle. But in both cases the vcs applies its data near the end of the last cycle. Thats why I was saying, you cant time the write from the beginning of the cycle. You must wait until the moment the VCS changes the address bus. The reason it is so difficult for me is because of the SRAM sharing, the VCS doesnt directly touch the SRAM. There is an arbitrator in between that handles requests from the VCS and ARM. Its is not able to react fast enough at that last moment in the cycle to capture the data before it gone. So the hard part was artificially maintaining the data bus so the write could complete. Anyway, thats more than you probably than you wanted to know, the short story is it works.

 

Vern

Link to comment
Share on other sites

Some times a write instruction will hold for two vcs cycles with the vcs applying its data near the end of the second cycle. Other times its just a single cycle. But in both cases the vcs applies its data near the end of the last cycle. Thats why I was saying, you cant time the write from the beginning of the cycle.

 

In the 4A50 cart, a "STA $1E00,X" instruction will start by outputting address $1E00 on the bus. I'll read the appropriate address from RAM during the first half of the cycle and leave it on the bus through the second half. The 6507 won't perform a write at that point, but I'll write the bus value back into RAM anyway. Then on the next cycle the address will still be $1E00, so I'll read the appropriate location again and put it on the bus. This time the 6507 will output something on the bus, and when I write the bus contents back to RAM it will store the 6507's data.

 

When using queues, you'd have to be a little careful about how you handle pointer updates. Best bet would probably be to increment pointers when the address bus 'leaves' a queue location.

Link to comment
Share on other sites

In the 4A50 cart, a "STA $1E00,X" instruction will start by outputting address $1E00 on the bus. I'll read the appropriate address from

This is the timing for an absolute address write, but the timings are not the same for others. I will post a screen cap from the analyzer for digdug. If you timed a write, based on a 840ns VCS cycle, to happen in the second half of the cycle, it would fail.

 

When using queues, you'd have to be a little careful about how you handle pointer updates. Best bet would probably be to increment pointers when the address bus 'leaves' a queue location.

This is exactly how I handle it, so I dont think anything will have to change. I havent tested the writes with any of the special features yet to see if there are any any unintentional consequences.

 

Vern

Link to comment
Share on other sites

This is the timing for an absolute address write, but the timings are not the same for others. I will post a screen cap from the analyzer for digdug. If you timed a write, based on a 840ns VCS cycle, to happen in the second half of the cycle, it would fail.

 

I'd be interested in seeing that. Here's what I would expect for an STA $1E00 executed from address $1000 if the accumulator holds $AA and address $1E00 used to hold $55, and the following instruction is a LDA #$00.

-[PHI1]-[PHI2]-[PHI1]-[PHI2]-[PHI1]-[PHI2]-[PHI1]-[PHI2]-[PHI1]-[PHI2]-[PHI1]-[PHI2]
X[-- $1000 --]X[-- $1001 --]X[-- $1002 --]X[-- $1E00 --]X[-- $1003 --]X[-- $1004 --]  Address
XXX[8D.........]X[00.........]X[1E.........]X[55.]XX[AA..]X[A9.........]X[00.......]  Data
---OO------II----OO------II----OO------II----OO------II----OO------II----OO------II- Action

And here's what I'd expect for STA $1E00,X when X is $33.

-[PHI1]-[PHI2]-[PHI1]-[PHI2]-[PHI1]-[PHI2]-[PHI1]-[PHI2]-[PHI1]-[PHI2]-[PHI1]-[PHI2]
X[-- $1000 --]X[-- $1001 --]X[-- $1002 --]X[-- $1E33 ----------------]X[-- $1003 --]  Address
XXX[9D.........]X[00.........]X[1E.........]X[55...............]XX[AA..]X[A9.......]  Data
---OO------II----OO------II----OO------II----OO------II----OO------II----OO------II- Action

Address shows the address bus contents; data shows the data bus. Action shows what my cart is doing at any given time (O=data output; I=data input). For the page-crossing case, if $1C32 holds $99 and $1D32 holds $88, I'd expect

-[PHI1]-[PHI2]-[PHI1]-[PHI2]-[PHI1]-[PHI2]-[PHI1]-[PHI2]-[PHI1]-[PHI2]-[PHI1]-[PHI2]
X[-- $1000 --]X[-- $1001 --]X[-- $1002 --]X[-- $1C32 --]X[-- $1D32 --]X[-- $1003 --]  Address
XXX[9D.........]X[FF.........]X[1C.........]X[99........]XX[88.]XX[AA..]X[A9.......]  Data
---OO------II----OO------II----OO------II----OO------II----OO------II----OO------II- Action

Link to comment
Share on other sites

I'd be interested in seeing that.

I will get this to you John. I tried tonight, but due to a comedy of errors I ended up with a bunch of dog hair on my board and I seemed to have smoked a couple chips. While I was removing one of the chips to replace it, I ripped a trace off the board. I will make the switch a new board tomorrow and I will capture that stuff for you.

 

Vern

Edited by Delicon
Link to comment
Share on other sites

Congrats on implementing the magic writes. You guys won't realize how liberating it is to be able to write to the cart as if it were normal RAM until you try it for yourself.

 

One of the things that Delicon was discussing with me was the ways in which the hotspots could be customizable. He has wired all the usable addresses from $FF00-FFFF to trigger hotspots. But after the CPLD notifies it, the hotspot handling itself is purely a function of the ARM. That means we may not have to settle on a singular hotspot layout or behavior. The game itself might be able to set up the hotspot handlers. This would be one way to install custom ARM code. The original idea for custom code was that we would have an instruction queue dedicated to passing ARM copro function calls and you would call the function by its registered ID. You'd then spool out the result via an output queue or it would draw to queues for bitmaps. However, a different kind of custom piece that might be small enough to fit in ARM RAM would be these hotspot handlers. If some specialized way of handling queues is desired which is not built in, you could try writing it yourself. We're trying to limit queue features such that they do not require any padding between reads. But most games will load and then immediately store the value to a TIA register, providing natural 3 cycle padding. By making it custom, you would just hardcode all the configuration directly into the routine.

Link to comment
Share on other sites

OK, building on prior ideas, I think the only way to handle true x/y scrolling is you point your kernel to 6 fast queues (as a framebuffer) and you'd have extra RAM around the edges that prebuffers the incoming data. The data would either get transformed for playfield diaplay as it gets added to the framebuffer or at read-time in the queue handler. If the queue handler can handle it, you'd want to have it get done there to reduce the VBLANK time necessary. The incoming data would automatically be retrieved from SRAM by the ARM based on a virtual X/Y location. So this would take advantage of the fact that only enough SRAM has to be read in order to process the perspective change from frame to frame. Since it pulls 8 bits at a time, the ARM could split its workload across frames on the horizontal buffering. It would have more work to do on the vertical. If the data changes while it's in the buffer, then the ARM also has to overwrite the SRAM on the way out. In order to avoid corruption, you'd want to essentially "double buffer" the buffers. One buffer would be the active buffer which the ARM is currently pulling bits from to go into the main display. The inactive buffer will be incomplete until the last moment and then toggled over. So that would consume 10 buffers. You'd still have a couple leftover for sprites, plus any unused the SRAM ones.

 

If we had 32K of ARM RAM then the entire scrolling playfield engine including the data could reside there including tiling, and building the bitmap each frame would be faster because it never touches the SRAM. But then it's much more removed from the the VCS.

 

Please note I don't know how you'd be able to implement much in the way of color changes that worked aesthetically with an x/y scrolling game. I think the Pitfall II approach is probably the best compromise for the TIA where you screen flip x and scroll y, or maybe some combination of the two where you'd scroll for a while, then stop and screen flip as you enter a new color scheme.

Link to comment
Share on other sites

Just wild brainstorming here...

 

How could the queues help out with displaying 12-char no flicker text (using a 4-pixel-wide font)?

 

The problem with this is you have to dynamically build your text in RAM since each sprite (byte) holds two characters instead of one. This takes a while, I think about 5 scanlines for a single line of text.

 

The basic routine looks like this (for 4x5 font):

   lda #0
  sta Temp1
GetTextDataOuterLoop
  ldy #11
GetTextDataInnerLoop
  lda (TextPointer),Y		;Right character.  TextPointer points to table with "text" - 
						  ;pointers to low byte of char-shape tables
  sta MiscPtr2			   ;(assuming high byte of address is the same, 
						  ;i.e., char-shape tables all in one page)
  dey
  lda (TextPointer),Y		;Left character
  sta MiscPtr1
  sty Temp3				  ;save Y
  ldy Temp1
  lda (MiscPtr1),Y
  sta Temp2				  ;get left character
  lda (MiscPtr2),Y
  lsr
  lsr
  lsr
  lsr
  ora Temp2				  ;OR with right character
  pha						;and push line of char data
  ldy Temp3				  ;restore Y
  dey
  bpl GetTextDataInnerLoop
  inc Temp1
  lda Temp1
  cmp #5
  bne GetTextDataOuterLoop

That can be optimized in many ways, but that's the basic idea. Can some kind of fancy queue help this out in any way? Just wishing...

Edited by vdub_bobby
Link to comment
Share on other sites

If there was a way to OR two queues together with a single read. The routine I used has the char-shapes in 2 pages, one page having the characters in the lower nybble, the other holding them in the upper nibble. This eliminates the LSR at the expense of ROM(though I pack the table by overlapping shapes so both tables now occupy 1 page.)

 

ldy #0
preploop
lda (message),y
sta queue_0,y  ; set queue to letter - even queues have char graphic in upper nybble, odd queues in lower nybble
iny
cpy #12
bne preploop

  ldy #5 
displayloop
  sta WSYNC
  SLEEP 20			 ;  (not sure actual value)
  lda queue_or_0_1 ; returns result of queue 0 ORed with queue 1
  sta GRP0
  lda queue_or_2_3
  sta GRP1
  lda queue_or_4_5
  sta GRP0
  lda queue_or_6_7
  sta GRP1
  lda queue_or_8_9
  sta GRP0
  lda queue_or_10_11
  sta GRP1
  dey
  bpl displayloop

Edited by SpiceWare
Link to comment
Share on other sites

We're already planning on offering complete ARM-based text rendering on Chimera. We need that in order to have a nice low-overhead menuing application. Thomas supplied me with the necessary modification to the Stellar Track kernel to allow for true 8x8 characters (nicely centered by the way) or with automatic offsetting for lowercase descenders.

 

By not wasting time rendering the next row while between rows you can do a full seamless character mode this way.

 

We're planning to use the 128-character Atari 8-bit font, loaded into RAM, so you can use redefined character sets. We may not even impose a cell boundary. We might just let you plot text to any arbitrary X-Y. You can try making games with just character graphics this way although overlapping animation would be tricky.

 

The nice thing about rendering engines like this is the ARM has no idea what kind of kernel you're going to use with it. All it assumes is the memory-to-screen-coordinate relationship (i.e. a series of columns of graphics).

 

So you could use the same engine with a Stellar Track type kernel, or a narrower 6-char, or Suicide Mission. (Playfield would require some transformation at some stage).

 

I can think of a couple different ways you could do 4-bit using this system. Either we support that natively or we allow a kind of "print thru" mode where you can overlay two characters into one cell. Then you just use a special redefined character set.

Edited by mos6507
Link to comment
Share on other sites

I would suggest that it might be useful to support AND/OR read/modify/write transformations, at least on the internal queues. This would greatly facilitate both 4-pixel text and high-resolution graphics. In the case of 4-pixel text, set the queues up for straight writing, then draw the left characters in every column, then switch to "or" mode and draw the right character in every column. A very simple copy loop in either case.

 

I think my personal preferred style of kernel would probably be a 12-column version of the Ruby Runner kernel, if queues would allow for such a thing being accomplished. The Ruby Runner kernel at present only supports 10 columns (80 pixels) but using flicker-blinds is able to show four colors plus black. If used in a suitable queueing architecture, it would be possible to draw something like an 80x160 bitmap with a 20x20 colormap. A game like Robotron could run very nicely under such a scenario. It is a little irksome only having 10 columns instead of 12, but perhaps queueing would save enough cycles to make 12 work. Note, btw, that it would probably be necessary to use an HMOVE trick that is at present unsupported by emulators.

Link to comment
Share on other sites

This thread has convinced me to bump the ARM up to the 236x series. That means 32K of RAM and at least 128K flash for built-in library functions. So we shouldn't have any more technical problems. We will have room for more specialized functions. I'm just crying over the chip being so much more expensive than the 2103, and I'm crossing my fingers that either a 2103-like chip with more RAM comes out or there is a price-drop on 236x's before we start selling final boards.

 

We're now about $20 in raw parts costs, not taking into account the case and the packaging. I was hoping we could sell the standalones for $30 and the embeddeds for half that. So if I stick to my guns that means I take a big loss in order to initially subsidize it. But I realized the 2103 would never ever pay off on its raw clock speed with such little memory.

Link to comment
Share on other sites

If there was a way to OR two queues together with a single read. The routine I used has the char-shapes in 2 pages, one page having the characters in the lower nybble, the other holding them in the upper nibble. This eliminates the LSR at the expense of ROM

That's how I'm doing the 24-column text in "E.T. Book Cart"-- 12 players (with flicker), each having two 4-pixel-wide characters, with the character set defined twice in ROM-- one set for the hi nybble or left character, and the other set for the lo nybble or right character, so I can ORA the two values together without needing to LSR anything.

 

However, it's taking me about 8 scan lines to preload a row of text, due to a few considerations:

 

(1) The two character sets are stored in one bank (along with the kernel), and the text data is stored in several other banks, so I can't load the text data and the character data at the same time. What I'm doing is storing a short subroutine in zero-page RAM, which I call to switch to the appropriate bank for the text, then read a row of text and store it in zero-page RAM, then switch back to the bank with the kernel and character sets to load the appropriate character shape data into RAM.

 

(2) I'm using four special control characters to help reduce the amount of ROM needed to store each screenful of text-- horizontal tab, line feed, vertical tab, and form feed. So the routine that reads a row of text and stores it into RAM needs to check for those special characters, and act accordingly when they're found, which increases the time needed to copy a row of ROM text into RAM. I believe it's taking almost 3 scan lines to copy a row of text into RAM.

 

(3) The process of fetching and storing the character shape data is taking over 4 scan lines, and would take even longer, but I'm storing only the last six characters (or last three player shapes) this way. The first six characters (first three player shapes) are being fetched within each display line, then I get the other three player shapes from RAM for the rest of the line.

 

(4) To keep the timing consistent, I have to use a STA WSYNC at just the right spot in the routine that fetches the character shape data.

 

(5) I'm using JSR and RTS, along with a few JMPs, so that's also adding to the preload time.

 

Taken all together, the preloading requires almost 8 scan lines. And the character shapes are also 8 scan lines tall, so that works out rather nicely (i.e., it's as if the rows of text are double-spaced).

 

If I could have used one queue to grab the text data, and another queue to grab the player data, it would have been a heck of a lot nicer! :)

 

Michael

Link to comment
Share on other sites

Queues can be used as simply another addressing mode. You don't have to restrict yourself to using them in the kernel to directly feed the TIA. For instance, you could use them to gain read/write access to other banks without bankswitching. I do think for text rendering you would be better off getting the ARM to do the work during VBLANK. I'd have to look at the kernel timings but I think you could also render the text using just the Supercharger. You need more RAM than just Superchip. I can just say that I dislike the large gaps between rows that you get with Stellar Track due to the JIT rendering. If you want to present the most information on one page you have to use more RAM.

Link to comment
Share on other sites

I can just say that I dislike the large gaps between rows that you get with Stellar Track due to the JIT rendering. If you want to present the most information on one page you have to use more RAM.

True, but the "E.T. Book Cart" uses no extra RAM, just the standard 128 bytes of zero-page RAM. I look forward to being able to read from other banks without doing any switching. :lust:

 

Michael

Link to comment
Share on other sites

Just remembered today about something I wanted a long time ago: math functions.

 

Is there a way to use queues to perform math functions like square root, x^y, factorial, multiply, divide, etc.?

 

Maybe have a "math" queue that you write values to, with a hotspot to specify the function , which is applied when you read from the queue? Maybe two math queues for math functions with two numbers (multiply, divide, x^y,...)? And what about trig functions?

 

For complex games (like ones in 3D space) this would really come in handy, for example in calculating distances or transforming vectors.

Link to comment
Share on other sites

Yes, that is high on the priority list. We just have to settle on an efficient message-passing scheme.

 

Aside from working with native VCS user input and TIA registers, it will be up to the programmer's choice and available ARM RAM as to how much of the actual game executes on the 6507 vs. the ARM end. There is going to be a lot more room for code on the VCS end but if we have a complete set of libraries on the ARM end then most of the VBLANK processing would just be calling those, either from the VCS or ARM end. There are advantages and disadvantages to both.

 

With the ARM, there is limited RAM space. With the 16-bit instructions, even with 32K it constrains how much custom stuff you can run from RAM. SRAM access would be slow but still faster than the 6507, and it would be able to work with all 128K at once regardless of the current banks.

 

With the VCS you would have a lot of available code space and it will be easier for people who already know 6502 asm but everything would run slower and you have to bank in and out 2K chunks.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...