Jump to content
IGNORED

Would it be possible to "bus stuff" NES PPU and CPU, DCP+ style?


Recommended Posts

First off, let me confess that this is not only in the wrong subforum but possibly the wrong forum. Maybe it would be better served on nesdev. Regardless, there is no "NES programming" subforum on Atariage so feel free to move this.

 

Anyway the main hangup for Atari 2600 regarding bus stuffing is that the cart bus cannot directly drive the TIA without first passing commands to the 6507 to write the TIA registers. Still, the fact that homebrew carts for 2600 exist which have an ARM dozens of times more powerful than the main CPU is impressive. But you guys know this. Melody DCP+ homebrews exist and are awesome.

 

Then I started thinking about some of the new homebrew mappers for the NES, my favorite system (sorry Atari), which support 4-way mirroring and on cart flash storage for game saves. These use re-writable flash ROM for PRG and typically CHR RAM. I wondered if it were somehow possible to "bus stuff" the NES the same way DPC+ Melody boards bus stuff the Atari, then got to thinking about how the PRG and CHR busses on NES work, and I had an epiphany.

 

NES is the only system I know of with direct bus access to tile graphics ROM on the cart side. Nearly every other system copies graphic tiles to program or video RAM or otherwise loads the assets off the cart and into the system. This is why the NES can yield such impressing graphics with only 2+2 kbytes of RAM (video plus program) considering 8 kilobytes of graphics data are directly accessible via the CHR bus (more instantly available with zero load times via bank switching).

 

This works by sending a lookup address to the ROM or RAM bank, and the chip sends back 8 bits of data as one byte. A lot of impressive tricks can be performed by various 1st party, 3rd party, and homebrew mappers, but virtually all of them have a basic 8-bit ROM or RAM storage connected to the bus. The PPU streams 2-bit tile index data from the CHR-ROM on a continuous basis, and every pixel onscreen gets assigned a color value based on an index contained within the tile data, applied to a color map via the PPU. This tile data is continuously streamed from CHR ROM or RAM non-stop. Without CHR bus, the NES cannot display tile graphics.

 

So instead of relying on the PPU to send address data to the CHR ROM and fetch the 8-bit value of said address, what if the address values requested by the PPU were basically ignored? If the PPU pixel clocks could somehow be counted by the cart mapper, it would be possible to "race the beam" by streaming a series of 8-bit values for every read of the CHR "ROM". This essentially would allow the cart to stream full motion video through the PPU, although the color index would be severely limited by the NES hardware. Essentially a selection of 4-color pallets (3 plus background) that can be divided into blocks onscreen. Additional swatches of preset color pallets could be overlaid on the fmv screen using hardware sprites.

 

Regardless, some fairly elaborate visual performances could be displayed. A coprocessor inside the cartridge could store video info into a frame buffer and stream it to the NES PPU as it races the beam in realtime. The address ranges sent from the PPU to the virtual ROM chip can be used for timing or clock counts for the bus stuffing engine. Anyway, the FMV engine runs continuously on the cart hardware and a completely different engine not based on on NES architecture draws the graphics. It could also send all sorts of various commands to the NES CPU through the PRG bus rather than actual ROM data, similar to how DCP+ works for Atari.

 

A new game engine could be developed to harness the potential of DPC+ PPU/CPU bus stuffing on NES, doing all sorts on the NES that would be impossible with a traditional game with ROM + Mapper.

Link to comment
Share on other sites

DPC+ does not use bus stuffing. We added a "fast fetch" feature which makes it slightly faster than DPC, but not nearly as fast as bus stuffing. Examples:

 

DoDraw, runs on a stock cartridge, will update and color a sprite in 26 cycles:

   lda #SPRITEHEIGHT
   dcp SpriteTemp
   bcs DoDraw
   lda #0
   .byte $2C
DoDraw
   lda (GfxPtr),Y
   sta GRP0  ;+18 cycles
   lda (ColorPtr),y
   sta COLUP0 

Activision's DPC coprocessor (used in Pitfall II) gets that down to 14 cycles:

   LDA DF0DATAW ; 4
   STA GRP0     ; 3
   LDA DF1DATA  ; 4
   STA COLUP0   ; 3

Our enhanced DPC+ uses "fast fetch" to get that down to 10 cycles:

   LDA #<DF0DATAW ; 2
   STA GRP0       ; 3
   LDA #<DF1DATA  ; 2
   STA COLUP0     ; 3

While Bus Stuffing eliminates the load instructions to get that down to 6 cycles:

   STY GRP0     ; 3
   STY COLUP0   ; 3

 

While we've done tests to confirm it works, we've not actually created a bus-stuffing bankswitch format yet. Once we do we'll be able to create even better graphics than were possible using DPC+ - there's only 76 cycles per scanline, so the faster you can update TIA registers the more of them you can update on each scanline.

 

I'm not familiar with the NES, never had one and no desire to get one, so have no insight as to whether or not it can do bus stuffing.

  • Like 1
Link to comment
Share on other sites

First off, let me confess that this is not only in the wrong subforum but possibly the wrong forum. Maybe it would be better served on nesdev. Regardless, there is no "NES programming" subforum on Atariage so feel free to move this.

It's the right forum if you plan to use this as a 2600 emulator cart for the nes.

 

Anyway the main hangup for Atari 2600 regarding bus stuffing is that the cart bus cannot directly drive the TIA without first passing commands to the 6507 to write the TIA registers. Still, the fact that homebrew carts for 2600 exist which have an ARM dozens of times more powerful than the main CPU is impressive. But you guys know this. Melody DCP+ homebrews exist and are awesome.

There is still plenty of graphical goodness left to be unleashed on the 2600.

 

So instead of relying on the PPU to send address data to the CHR ROM and fetch the 8-bit value of said address, what if the address values requested by the PPU were basically ignored? If the PPU pixel clocks could somehow be counted by the cart mapper, it would be possible to "race the beam" by streaming a series of 8-bit values for every read of the CHR "ROM". This essentially would allow the cart to stream full motion video through the PPU, although the color index would be severely limited by the NES hardware. Essentially a selection of 4-color pallets (3 plus background) that can be divided into blocks onscreen. Additional swatches of preset color pallets could be overlaid on the fmv screen using hardware sprites.

That would work just fine, but you'd still have the palette limitations to deal with. I'm not sure when and how often they are accessed though. Maybe you could bus-stuff the palette values to produce a more colorful scene.

 

Regardless, some fairly elaborate visual performances could be displayed. A coprocessor inside the cartridge could store video info into a frame buffer and stream it to the NES PPU as it races the beam in realtime. The address ranges sent from the PPU to the virtual ROM chip can be used for timing or clock counts for the bus stuffing engine. Anyway, the FMV engine runs continuously on the cart hardware and a completely different engine not based on on NES architecture draws the graphics. It could also send all sorts of various commands to the NES CPU through the PRG bus rather than actual ROM data, similar to how DCP+ works for Atari.

 

A new game engine could be developed to harness the potential of DPC+ PPU/CPU bus stuffing on NES, doing all sorts on the NES that would be impossible with a traditional game with ROM + Mapper.

Sounds like you have some work to do :)

 

 


While Bus Stuffing eliminates the load instructions to get that down to 6 cycles:

   STY GRP0     ; 3
   STY COLUP0   ; 3

 

Make that 5 if you use address bus stuffing with a read-modify-write instruction.

Link to comment
Share on other sites

 

 

True. supercat mentioned that would be possible by using DEC, though I didn't quite follow how that worked.

Bus-stuffing is fairly simple in theory. It's the actual implementation that gets tricky. The basic idea is to get the CPU to write a $ff value and then force some of the bits to 0 in order to write the real value that you want.

 

The 3 cycle bus-stuff can be done by doing a STA ZP instruction and overriding the data bus on the last cycle. Locations where the bus-stuffing occur are marked with a *.

**Assumes A register is $ff
**D0 is the value $ff from A overridden by the bus-stuffing operation to the desired value

STA ZP Sequence without bus-stuffing:
Cycle:	      |  0  |  1  |  2  |
Address Bus:  |  PC | PC+1|  ZP |
Data Bus:     |  85 |  ZP |  A  |

STA ZP Sequence with bus-stuffing:
Cycle:	      |  0  |  1  |  2  |
Address Bus:  |  PC | PC+1|  ZP |
Data Bus:     |  85 |  ZP | *D0 |

The 5 cycle double write can be done by performing a ROL $ff instruction and overriding the address and data buses on each of the 2 write cycles. There are 2 write cycles because ROL $ff is a read-modify-write (RMW) instruction and the 6507 implementation of this instruction performs a dummy write while the rotate occurs. The first write is the original value from $00ff being written right back. The second write is the result of the rotate operation being written back. By setting $00ff to the value $ff ahead of time and having the Carry bit flag set, the rotate operation results in the same $ff value that it started with. This causes $ff to be on the data bus for both writes. By using the zero page location $ff the address bus is also $ff for both of the writes. It is very important that the value from the 6507 be $ff or 11111111b because it is only safe to force a 1 to a 0. Trying to force a 0 to a 1 would most likely destroy the 6507.

**Assumes $00ff = $ff and Carry flag set
**A0, A1, D0, and D1 are the value $ff overridden by the bus-stuffing operation to the desired value

ROL $ff Sequence without bus-stuffing:
Cycle:       |  0  |  1  |  2  |  3  |  4  |
Read/Write:  |  R  |  R  |  R  |  W  |  W  |
Address Bus: |  PC | PC+1| $ff | $ff | $ff |
Data Bus:    | $26 | $ff | $ff | $ff | $ff |

ROL $ff Sequence with bus-stuffing:
Cycle:       |  0  |  1  |  2  |  3  |  4  |
Read/Write:  |  R  |  R  |  R  |  W  |  W  |
Address Bus: |  PC | PC+1| $ff | *A0 | *A1 |
Data Bus:    | $26 | $ff | $ff | *D0 | *D1 |

I don't think the DEC instruction would work as well because the data bus would contain $fe on the last write cycle instead of $ff. This would cause D1 to be masked with $fe, thus preventing bit 0 from ever being set on the second write.

  • Like 1
Link to comment
Share on other sites

I checked what he wrote and he quoted the DEC and the 6507 code doesn't actually have a DEC instruction in it.

It's been quite awhile, but I've written a byte-stuffing demo which limited itself to using a small set of instructions, IIRC something like

  NOP
  LDA #$A9  ; Use this when the ARM will be busy for more than one 6507 bus clock
  LDY #$FF
  JMP $1111
  STY $FF
  STY $FFFF
  ROL $FF
  ROL $FFFF
Rather than writing the display using 6507 assembly language, the display code on the ARM needed to continuously call subroutines which would the ARM to generate an instruction with particular desired operands.

Using this approach, I was able to have the ARM not only generate arbitrary stores on a per-cycle basis, but even generate two stores in five cycles using a "DEC" [the cycle following the ROL needed to have its address byte-stuffed to a TIA register whose upper bits would always read one; the two cycles after that could then be byte-stuffed with any desired address and data]. I was then able to write a kernel which used flicker blinds to output two groups of eight sprites, all of which were independently colored. Code without bus stuffing would need 80 cycles/line just for the just the GRPx/COLUPx stores alone, but bus stuffing cuts that in half (one COLUPx/GRPx pair per five cycles).

 


After reading yours, and re-reading supercat's, I'd mis-read the bus-stuffing of the address - for some reason I was thinking you could only bus-stuff the data bus. Thanks for clearing that up!

Link to comment
Share on other sites

I don't think the DEC instruction would work as well because the data bus would contain $fe on the last write cycle instead of $ff. This would cause D1 to be masked with $fe, thus preventing bit 0 from ever being set on the second write.

But the 2600 does not use bit0 for its color registers :?

Link to comment
Share on other sites

But the 2600 does not use bit0 for its color registers :?

Surely, you want to update more than the colors as quickly as possible. The 6 writes necessary for an asymmetric playfield can be done in 15=(6 * 5/2) cycles instead of 18=(6 * 3). This is assuming the atari can handle continuous bus-stuffing on both buses. Hopefully, I can test this out soon.
Link to comment
Share on other sites

Surely, you want to update more than the colors as quickly as possible.

 

True, but supercat's experiment was to "output two groups of eight sprites, all of which were independently colored". He probably had an idea that might work and tried it out.

 

Many times have I ran with an idea that worked well(event datastream) only to revisit it later and make it even better (jump datastream).

Link to comment
Share on other sites

 

True, but supercat's experiment was to "output two groups of eight sprites, all of which were independently colored". He probably had an idea that might work and tried it out.

 

Many times have I ran with an idea that worked well(event datastream) only to revisit it later and make it even better (jump datastream).

Now I'm confused. I thought the conclusion was that supercat's experiment also used ROL instruction and that the mention of "DEC" was unrelated.

 

Either way, I wasn't trying to imply that using a DEC instruction is wrong or a bad idea. Just pointing out that ROL would be more useful in a full blown implementation.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...