Jump to content
IGNORED

Blitting into GPU RAM


Recommended Posts

Is it possible to blit into GPU RAM while the GPU is running? Am I missing some magic?

 

I have some code which, for test, is copying some data to some screen memory. I can see if copying, it's a small buffer, there's no overflow, and it's working fine.

 

If I change the dest address to be within the GPU RAM it just dies. 🤷‍♂️

 

I've seen quite a few comments about blitting into GPU RAM, and there is even the write only 32bit space at G_RAM+$8000 specifically for this purpose. It seems crazy this isn't working...

Link to comment
Share on other sites

A bit more poking -- if I set the destination address to the object processor line buffer, that works, or anywhere in main RAM it works. Even Jerry RAM works, although it's quite slow. I'm starting to think you may only be able to blit into GPU RAM when the GPU is halted? Which would be a bit crap.

 

And just for reference this is using 32bit per pixel, in pixel mode, with A2 as destination so A1 can scan arbitrarily across some data. All buffers are phrase aligned.

Link to comment
Share on other sites

In GemRace I use a temporary buffer in GPU RAM. The GPU is setting up the blitter and waiting for the blit to complete. So blitting into GPU RAM with the GPU running is possible.

I'm using 16bit per pixel, destination in pixel mode, source in INC mode. A2 is the destination and A1 the source. Buffers are long aligned.

I can send you the whole GPU source if you want?

 

  • Like 3
Link to comment
Share on other sites

That’s good to know! With one test I set the buffer sometimes to GPU RAM and sometimes to DRAM based on the value of a counter, and that didn’t hang. However it was not very GPU heavy, mostly it was going to DRAM. So there must be some odd timing / contention related issue? If I have a tight loop of calculating the next span location and then starting the blitter and immediately waiting for completion it hangs pretty quickly (within a couple of seconds) but run fine forever when writing to DRAM.

 

I’d certainly be interested to see any GPU code you have with blitter use like this. It might be if I just add a large delay (which will actually be time taken by processing the buffer) after the blit wait has completed it may be happier. I’m just trying to get a proof of concept of getting the data where I need it before proceeding!

 

The idea here is rather than using the GPU to process the DDA to fetch multiple bytes at an arbitrary angle from a byte map, get the blitter to do this and put it in a linear buffer for far easier processing by the GPU and in parallel.

Link to comment
Share on other sites

3 hours ago, SainT said:

Is it possible to blit into GPU RAM while the GPU is running? Am I missing some magic?

I remember Gorf mentioning this once. Supposedly you can run GPU code in a low/high area of memory while loading GPU code into the unused portion.

Link to comment
Share on other sites

Check the Doom code. Chilly Willy claimed the gpu loaded itself. It had to have something running if so.

 

Quote

Given how efficient loading the GPU local ram is, having GPU code load itself in stages could have easily been used by more games than just Doom. It probably would have become standard if the Jag had lasted in the marketplace.

 

Link to comment
Share on other sites

2 hours ago, JagChris said:

Check the Doom code. Chilly Willy claimed the gpu loaded itself. It had to have something running if so.

 

 

 

No, its a byte copy loop.  Chris, this is so far above your paygrade, please don't shit up yet another thread with your crap.  Go back to reading kindergarden books or something. This is the programming section, after 20+ years you have zero to contribute in here, and mis-information is not helpful.

  • Like 4
  • Haha 2
  • Sad 1
Link to comment
Share on other sites

13 hours ago, SainT said:

Is it possible to blit into GPU RAM while the GPU is running? Am I missing some magic?

 

I have some code which, for test, is copying some data to some screen memory. I can see if copying, it's a small buffer, there's no overflow, and it's working fine.

 

If I change the dest address to be within the GPU RAM it just dies. 🤷‍♂️

 

I've seen quite a few comments about blitting into GPU RAM, and there is even the write only 32bit space at G_RAM+$8000 specifically for this purpose. It seems crazy this isn't working...

JagTris does run from GPU only and loads overlays via Blitter when needed. So yes, it is possible.

BJL also contains macros to handle this.

Be sure to use the +$8000 address for 32bit bus access.

  • Like 2
Link to comment
Share on other sites

Posted (edited)
;-----------------------------------------
;- Copy overlay routine
;-----------------------------------------
overlay:
	load (blitter+$38),r3
	shrq #1,r3
	jr cc,overlay
	nop

	store r0,(blitter)
	store r1,(blitter+$24)
	movei #BLIT_PITCH1|BLIT_PIXEL8|BLIT_WID320|BLIT_XADDPHR,r0
	xor r1,r1
	store r0,(blitter+4)
	store r0,(blitter+$28)
	store r1,(blitter+$c)
	store r1,(blitter+$18)
	store r1,(blitter+$30)

	movei #BLIT_SRCEN|BLIT_LFU_REPLACE|BLIT_BUSHI*0,r1
	store r2,(blitter+$3c)
	store r1,(blitter+$38)
	WAITBLITTER
	jump	(LR)
	nop

 

and loading

 

	movei #MODrun_\0+$8000,r0	; dest-adr
	movei #MODstart_\0,r1
	movei #1<<16|(MODlen_\0),r2
	movei #overlay,r3
	BL (r3)

 

Where MODrun_ => destination (run address), MODstart_ => source and MODlen_ ;-)

WAITBLITTER does the same as the entry code, but uses R0!

Edited by 42bs
  • Like 4
Link to comment
Share on other sites

Well, I got it working, but it is a bit unstable. I think the instability is to do with the 68000 and blitter both writing to the GPU RAM, so the hanging I was seeing was more to do with the GPU getting bad parameters passed from the 68000, I think. But more importantly it is worth around an additional 4fps, going from 18 to 22. So it's a good improvement.

  • Like 3
Link to comment
Share on other sites

Why is the 68k writing to GPU RAM? You should avoid this in any case, at least when the GPU runs.

If you want to pass parameters use a DRAM section.

But the 68k has a lesser prio than the Blitter, so there should not be any disturbance.

  • Like 2
Link to comment
Share on other sites

2 minutes ago, 42bs said:

Why is the 68k writing to GPU RAM? You should avoid this in any case, at least when the GPU runs.

If you want to pass parameters use a DRAM section.

But the 68k has a lesser prio than the Blitter, so there should not be any disturbance.

Good to know -- this is just a test case, so I have been writing parameters to GPU RAM.

 

There was no issue with parameters being corrupted before the blitter code was added, so there is definitely some kind of issue with the 68K and blitter both writing at the same time. The bus might be getting interrupted half way though the long write or something and one half getting corrupted.

Link to comment
Share on other sites

18 hours ago, CyranoJ said:

 

No, its a byte copy loop.  Chris, this is so far above your paygrade, please don't shit up yet another thread with your crap.  Go back to reading kindergarden books or something. This is the programming section, after 20+ years you have zero to contribute in here, and mis-information is not helpful.

Well, he is the same guy (under the handle "Achris31") that has the nerve to accuse members of this forum and AA in general of being "con artists" and "destroying homebrew game development" on almost every comment section under a Jaguar related video on youtube. Its a miracle he is still tolerated here or at any place with common sense!

 

  • Like 2
  • Haha 1
Link to comment
Share on other sites

Not speaking from experience, but what I've read is that while you can blit to GPU RAM from somewhere else, you can't blit from GPU RAM -> GPU RAM, so beware of that scenario. Hence, the LUT-as-span-buffer trick for texturing.

  • Like 2
Link to comment
Share on other sites

This is the result so far. There doesn’t seem to be much scope to speed up the rendering of the actual spans with the blitter, as if I remove the actual rendering I only get an additional few FPS. The majority of the time is spent just iterating and processing the height map ready for actual rendering.

FF2F2362-B6CE-4430-B94D-C9DD86A30D14.jpeg

  • Like 7
Link to comment
Share on other sites

1 minute ago, SainT said:

This is the result so far. There doesn’t seem to be much scope to speed up the rendering of the actual spans with the blitter, as if I remove the actual rendering I only get an additional few FPS. The majority of the time is spent just iterating and processing the height map ready for actual rendering.

FF2F2362-B6CE-4430-B94D-C9DD86A30D14.jpeg

Looks cool, Comanche voxel engine?

Link to comment
Share on other sites

Just now, agradeneu said:

Looks cool, Comanche voxel engine?

Yes, exactly that. Just a pretty normal voxelspace engine. The only slightly cunning bit is using the blitter to do the map traversal to allow a tighter inner loop.

Link to comment
Share on other sites

Just now, SainT said:

Yes, exactly that. Just a pretty normal voxelspace engine. The only slightly cunning bit is using the blitter to do the map traversal to allow a tighter inner loop.

Yeah but this looks more realistic thamn previous, due to a much more detailed texture.

  • Like 1
Link to comment
Share on other sites

2 minutes ago, agradeneu said:

Yeah but this looks more realistic thamn previous, due to a much more detailed texture.

True, I don’t think I’ve seen anything this detailed on the Jag in terms of voxelspace. It’s a 512*512 height and colour map with 8 bit height and 16 bit colour. It’s doing 100 depth samples and rendering to a 160*200 buffer. Increasing vertical resolution shouldn’t be that much slower either due to the column nature of voxels. There may be some kind of hierarchical approach that could be used to discard larger blocks of map data to speed things up as well. It’s a nice tight little domain for optimisation!

  • Like 2
Link to comment
Share on other sites

16 minutes ago, SainT said:

True, I don’t think I’ve seen anything this detailed on the Jag in terms of voxelspace. It’s a 512*512 height and colour map with 8 bit height and 16 bit colour. It’s doing 100 depth samples and rendering to a 160*200 buffer. Increasing vertical resolution shouldn’t be that much slower either due to the column nature of voxels. There may be some kind of hierarchical approach that could be used to discard larger blocks of map data to speed things up as well. It’s a nice tight little domain for optimisation!

That amount of detail and color is truly something special, great stuff!

  • Like 2
Link to comment
Share on other sites

Posted (edited)
37 minutes ago, SainT said:

True, I don’t think I’ve seen anything this detailed on the Jag in terms of voxelspace. It’s a 512*512 height and colour map with 8 bit height and 16 bit colour. It’s doing 100 depth samples and rendering to a 160*200 buffer. Increasing vertical resolution shouldn’t be that much slower either due to the column nature of voxels. There may be some kind of hierarchical approach that could be used to discard larger blocks of map data to speed things up as well. It’s a nice tight little domain for optimisation!

In my version you can change the Z depth and directly see the impact of it. Your colors look nicer!

Edited by 42bs
  • Like 2
Link to comment
Share on other sites

29 minutes ago, 42bs said:

In my version you can change the Z depth and directly see the impact of it. Your colors look nicer!

Yes, I have all of those types of controls on the controller. You can adjust depth samples, depth increment, horizon, etc all from the number pad. I’ve implemented a non-linear z increment as well such that the sample distance increases each successive depth step to give a greater view distance at the expense of detail. It works quite well! I also have sample quantisation and skipping in the near samples such that multiple columns are generated in a single sample. That was a useful optimisation, too.

  • Like 3
Link to comment
Share on other sites

25 minutes ago, 42bs said:

Do you plan a game or just a demo?

Depends on how I can allocate my time. It would be nice to do some kind of racing game with a voxel landscape engine, but getting the time to actually finish a full on game would be a hard push. Getting at least some kind of playable demo would be a nice goal.


Is there a way to load the blitter registers a word (16 bits) at a time? Or is it just long only? I tried testing word writes, but it seems to fail, or perhaps write a full long, I’m not sure. Having the integer and fractional parts split across different registers is quite annoying when working with fixed point numbers. More time than I’d like is spend shuffling registers into the right format.

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...