Jump to content
IGNORED

blitter access to line buffer


Recommended Posts

Pixel mode in the blitter is not some obscure technique only for texture mapping. I man it is also needed for rotation of course. Collision detection works in pixel mode only. Gouraud and z-buffer also work in pixel mode and don't stress the CPU as much .. so for short lines and small triangles ( high LoD tesselation ). I think pixel mode is needed to expand pixels to more bits even in a simple blit. One can use SRCshade to fill the other bits. Pixel mode works with the 16 bit CLUT as seen in the SDK. There is no conflict with OP because of the bus lock .. or maybe there would be one or two cycles after the switch. But anway, the OP needs 4/3 cycles for 1:1 scale, and 1 cycle for scale .. the latter only uses one of the CLUT address busses. The other is free for us. The line buffer has a 16 bit interface to the bus as well and thus works well in pixel mode. The OP runs through the vertical blank and thus can be used to load textures. There is a register to specify buffer swaps at certain horizontal positions. So that is a bit nasty, but for large textures used all over a frame maybe we keep the texture mulitple lines anyway. So I hate sorting because that costs a lot CPU / GPU / Jerry time ( Doom uses Jerry for that ). An advantage of the z-buffer is that we do not need to sort. Just render like the game logic supplies the data. Personally, I am only interested in dungeon games ( descent ), so I need to carefully select which "game objects" to touch anyway. But for other games? Racing games and almost all first generation 3d console games have theatrical stage like scenes. Almost every front facing polygon has some pixels visible. Back to front rendering is a good fit. Now the Jag is so fill rate limited that the fill rate is halved when you hide behind a corner in Skyhammer or when you would use a blitter to draw those bridges or canyons in OutRun. I found one account of someone talking about span sorting. So that person was probably aiming for very high poly counts, but for the realistic number of polygons, as seen with BattleSphere or tunnels in pseudo-2d racing games (and checkered flag with out scenery, or that F1 thing). Since we are fill rate limited, we mostly care about larger polygons. We render spans front to back into a buffer with fixed capacity. If spans fuse on the way, great! If the buffer is full, accept overdraw. The pipeline architecture of JRISC likes parallel processing. I mean for the buffer there would be an unrolled loop with interleave of branches. Like you do two CMP then a branch some store another branch. Proper mixing reduces wait states in the GPU. As the GPU memory as texture cache code says: GPU SRAM is indeed 32 bit and thus two instructions are fetched at a time. Just align your lables and keep Load / Store away from them! An unrolled loop also allows to do bubble sort within the registers alone. So basically looking at the number of registers and the low amount of SRAM, sorting up to 32 values is very fast. Anything above nees some tree approach ( less robust) . With RLE sprites or polygons only a part of the edges should cross as we advance to the next scanline. Thus bubble sort needs to count / detect the number of changes. Also we have to do other stuff inbetween those sorts, I guess pure register based is not as useful as I first thought. Maybe have a first pass which also loads, then register based until sorted, and then store.

 

This would also work in Super Burn Out ( to increase horizontal resolution? ) . We can use the spans from the large OBJECTs as occluders and also tell the OP not to load invisible phrases. Pixel mode is not really a thing with OP and there is a big cost per sprite per scanline anyway, which prompts everyon to use the blitter to draw the smaller sprites into the linebuffer. Even for unscaled sprites the blitter is 2 times slower. For opaque objects the Z buffer could be uses as a second phrase. But 16px wide spans .. an extra code path for these. I dunno.

 

So in the end: The 5 cylces per pixel on the blitter is a given. There are no records. It is just sad that the memory is on idle most of the time. If the blitter is working all the time, we could fill every pixel in the windshield of for example TestDrive ( Amiga, PC ) (320x100px)  . It is what it is. But if we decide to like the Jaguar, we decided to like that most of the memory bandwidth is wasted. Got that out of the way . Of course you get conflicts with the GPU if you try to do it in SRAM because the GPU continues to run even if it does not own the bus. Software rendering on the GPU always comes out a bit slower than those 5 cylces. So we are stuck with slow and slightly slower. Benchmarking will show two applications: For scaled down you better collect a phrase of pixels and then blit to DRAM ( for lower than 60 fps or full scanlines ). For scaled up / tiling you batter cache the original texture. The latter would promt me to put those draw calls into the vertical border. So I cannot sort by z? It is for scaled up only, so the part until the span buffer overflows. So many code paths to benchmark .. 

 

Anyway, beyond flat shaded games which look like on a 16 bit console, Wing Commander, Out Run, Fight4Life are possible. Fight4Life just has bad game mechanics. And it needs a simple floor .. I want all the polies on the fighters. Sorry, cannot have nice background on the Jag.

  • Confused 6
Link to comment
Share on other sites

19 hours ago, ArneCRosenfeldt said:

Pixel mode in the blitter is not some obscure technique only for texture mapping. I man it is also needed for rotation of course. Collision detection works in pixel mode only. Gouraud and z-buffer also work in pixel mode and don't stress the CPU as much .. so for short lines and small triangles ( high LoD tesselation ). I think pixel mode is needed to expand pixels to more bits even in a simple blit. One can use SRCshade to fill the other bits. Pixel mode works with the 16 bit CLUT as seen in the SDK. There is no conflict with OP because of the bus lock .. or maybe there would be one or two cycles after the switch. But anway, the OP needs 4/3 cycles for 1:1 scale, and 1 cycle for scale .. the latter only uses one of the CLUT address busses. The other is free for us. The line buffer has a 16 bit interface to the bus as well and thus works well in pixel mode. The OP runs through the vertical blank and thus can be used to load textures. There is a register to specify buffer swaps at certain horizontal positions. So that is a bit nasty, but for large textures used all over a frame maybe we keep the texture mulitple lines anyway. So I hate sorting because that costs a lot CPU / GPU / Jerry time ( Doom uses Jerry for that ). An advantage of the z-buffer is that we do not need to sort. Just render like the game logic supplies the data. Personally, I am only interested in dungeon games ( descent ), so I need to carefully select which "game objects" to touch anyway. But for other games? Racing games and almost all first generation 3d console games have theatrical stage like scenes. Almost every front facing polygon has some pixels visible. Back to front rendering is a good fit. Now the Jag is so fill rate limited that the fill rate is halved when you hide behind a corner in Skyhammer or when you would use a blitter to draw those bridges or canyons in OutRun. I found one account of someone talking about span sorting. So that person was probably aiming for very high poly counts, but for the realistic number of polygons, as seen with BattleSphere or tunnels in pseudo-2d racing games (and checkered flag with out scenery, or that F1 thing). Since we are fill rate limited, we mostly care about larger polygons. We render spans front to back into a buffer with fixed capacity. If spans fuse on the way, great! If the buffer is full, accept overdraw. The pipeline architecture of JRISC likes parallel processing. I mean for the buffer there would be an unrolled loop with interleave of branches. Like you do two CMP then a branch some store another branch. Proper mixing reduces wait states in the GPU. As the GPU memory as texture cache code says: GPU SRAM is indeed 32 bit and thus two instructions are fetched at a time. Just align your lables and keep Load / Store away from them! An unrolled loop also allows to do bubble sort within the registers alone. So basically looking at the number of registers and the low amount of SRAM, sorting up to 32 values is very fast. Anything above nees some tree approach ( less robust) . With RLE sprites or polygons only a part of the edges should cross as we advance to the next scanline. Thus bubble sort needs to count / detect the number of changes. Also we have to do other stuff inbetween those sorts, I guess pure register based is not as useful as I first thought. Maybe have a first pass which also loads, then register based until sorted, and then store.

 

This would also work in Super Burn Out ( to increase horizontal resolution? ) . We can use the spans from the large OBJECTs as occluders and also tell the OP not to load invisible phrases. Pixel mode is not really a thing with OP and there is a big cost per sprite per scanline anyway, which prompts everyon to use the blitter to draw the smaller sprites into the linebuffer. Even for unscaled sprites the blitter is 2 times slower. For opaque objects the Z buffer could be uses as a second phrase. But 16px wide spans .. an extra code path for these. I dunno.

 

So in the end: The 5 cylces per pixel on the blitter is a given. There are no records. It is just sad that the memory is on idle most of the time. If the blitter is working all the time, we could fill every pixel in the windshield of for example TestDrive ( Amiga, PC ) (320x100px)  . It is what it is. But if we decide to like the Jaguar, we decided to like that most of the memory bandwidth is wasted. Got that out of the way . Of course you get conflicts with the GPU if you try to do it in SRAM because the GPU continues to run even if it does not own the bus. Software rendering on the GPU always comes out a bit slower than those 5 cylces. So we are stuck with slow and slightly slower. Benchmarking will show two applications: For scaled down you better collect a phrase of pixels and then blit to DRAM ( for lower than 60 fps or full scanlines ). For scaled up / tiling you batter cache the original texture. The latter would promt me to put those draw calls into the vertical border. So I cannot sort by z? It is for scaled up only, so the part until the span buffer overflows. So many code paths to benchmark .. 

 

Anyway, beyond flat shaded games which look like on a 16 bit console, Wing Commander, Out Run, Fight4Life are possible. Fight4Life just has bad game mechanics. And it needs a simple floor .. I want all the polies on the fighters. Sorry, cannot have nice background on the Jag.

image.png.40c0bbb0fc1a79ba83a8d97b22791479.png

 

 

  • Haha 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...