Jump to content
IGNORED

GPU in main, SCIENCE!


LinkoVitch

Recommended Posts

Hi,

I made a post in the Klax thread (which I may post a link to this post in, full on threadception! :D) relating to GPU in main. Given the number of times that thread has wobbled on it's rails I'd rather not give it another nudge so thought it might be worth posting my evenings activities in my secret Jaguar lab here in it's own thread safely locked away in the programming section :D

 

So, basically I have gone and done actual sciencey things with the GPU and my jag and written up my experiment and finding on the U-235 website

 

http://www.u-235.co.uk/gpu-in-main-science/

 

Hopefully this will help put some perspective on the drawbacks of GPU in main, and well.. SCIENCE! :D (I was never very good at doing science write ups at school, always had more fun with the chemicals and Bunsen burners than pens and paper.. so apologies if it's rambly or loses its way a bit :D )

 

For those who CBA with all the reading, just scroll down there are numbers and simple comparison at the end of the waffle

 

 

10 TIMES SLOWER!!!!!!1111oneoneoneoneoneoneeleventy

 

  • Like 10
Link to comment
Share on other sites

I take it from CyranoJs post that the 68k could be rendering sweet fanny adams and still cause bus contention?

 

Yup, if it's not stopped it will happily work it's way through it's program nibbling at the bus as it goes. In my test it was still running, but then nothing else was, so it's not really taking that much extra from the GPU. In a real game you will at least have the OP running, which is going to get a lot of the bus a lot of the time.. unless your game has no visuals :D

  • Like 5
Link to comment
Share on other sites

Don't want to mess this thread up too much, here's a quote from Typo in an interview with JagChris:

I think the "evil" of the 68K has been way overstated. It has the lowest priority on the bus so it won't hinder the GPU & Blitter that much.
While I was developping Tube, I made a few tests regarding this.
To render the tunnel effect, the GPU has to access memory rather intensively.
I mesured the performance with the 68k polling the GPU memory to know when the GPU is done.
So we have both the GPU and the 68k accessing the bus intensively.
Then I tried with the 68K turned off. I got a +15.5% performance gain. Not bad, but it's not +300% like you could believe.
And this is close to a worst case scenario. In actual gameplay code, the 68K will execute more complex instructions (long jumps, complex addressing modes) and will be off the bus more often, leaving more cycles for the GPU.
In the case of Tube, the gameplay code is very light. When I remove it, the gain is less than 1%.
The problem is when you give too much work to the 68K. For example, in a advanced 3D game, if you compute all the physics calculations with the 68K you may end up in a situation where the GPU is waiting for the 68K to complete its work.

In the case of "Fallen Angels", I made a benchmark to see if it was worth the trouble to do some of the gameplay code using GPU in main.
To do this benchmark I wrote a box collison routine. It isn't very "GPU in main" friendly because it involves frequent jumps. But it is pretty typical to the kind of code I was going to need.
In the benchmark there are 200 collision tests done.
The result are:
voxel rendering alone, 68K turned off: 107 ticks
voxel rendering 68K turned off then the 68K does 200 collisions test: 134 ticks
voxel rendering then 200 collision tests by the GPU in main, 68K turned off the whole time: 124 ticks
voxel rendering with the 68k doing 200 collision test in parallel, then 68k off : 108 ticks.

I made more recent tests by removing most of the gameplay code... the gain is microscopic (less than 1%). So that's why I think moving the whole gameplay code to the GPU would result in worse performance.

Maybe the DSP could help a bit but it's already used by the sound engine. Anyway I think the gain would be very small.


linky

  • Like 5
Link to comment
Share on other sites

My analysis, while working on Project One, many moons ago:

 

TOM: Tom is doing the main gruntwork here. It only has a tiny 4k of scratchram, but over 95% of the main game code is in there frantically crunching away to give us the required 50/60fps. It is responsible for:

  • Animating everything on screen (basically, if its an object then TOM handles it's update)
  • Driving the bullet engine (Bullet height sorting (For the OP), movement and bullet injection)
  • Processing smartbombs
  • Colision detection (All object to object, and bullet to object collisions processed EACH FRAME)
  • Ship updates and Parallax tracking control
  • 3D starfield update on the parallax layer
  • Map Tile updates (all map objects, every frame)
  • All this comes in at exactly 3.94k! PHEW!

 

tl;dr version - you can fit a LOT in 4k if you try.

 

[edit: sh3-rg didn't say that, something arsed up with the quote button]

Edited by CyranoJ
  • Like 5
Link to comment
Share on other sites

Thats awesome you guys working through all this but none of the main proponents of the GPU in main workaround ever said it would be a big boon in 2D. Their contention was if you were going to start throwing around polygons the benefits would be felt. If it was of no use I don't think AtariOwl would of been wasting his time with it in his 3D stuff. Why do that? Its extra work.

Link to comment
Share on other sites

Thats awesome you guys working through all this but none of the main proponents of the GPU in main workaround ever said it would be a big boon in 2D. Their contention was if you were going to start throwing around polygons the benefits would be felt. If it was of no use I don't think AtariOwl would of been wasting his time with it in his 3D stuff. Why do that? Its extra work.

 

3D would be worse! 2D stuff is really quite simple in comparison, you have a bunch of objects, you update their X/y/frame etc and you're done! for 3D there is a lot more to do! So running instructions out of main RAM would slow down your whole computation section, this is a simple FACT.

 

Yes Atari Owl used it, but what for? was the core of his 3D engine running out of main? or perhaps he had filled Local RAM with that and simply shifted some of the lower frequency routines into main? The only person that can answer what he did and why would be the man himself. As I said, there will be edge cases where it's valid. It will be useful for larger chunks of code that won't fit into local, (but then these may be able to be broken down, but this then brings other obstacles, such as paging the code and vars in and out of the local RAM), it may simply be that those who have had a good reason to use GPU in main have done so purely for their own sanity, trying to juggle multiple routines and registers around in local RAM may have been more of a headache than dropping your CPUs effective speed 90%.

 

3D code doesn't run in anyway different to any other code, it benefits from fast RAM, probably more than 2D would...

 

When writign code you need to make decisions on how to do things, I am just trying to say that defaulting to GPU in main for every situation is wrong, and I mean that for 3D, 2D or anything else. It's there, it's possible but by no means is it the magic bullet, and IMHO if anything it should be avoided where possible.

Link to comment
Share on other sites

Wow that's amazing. Poor AO did all that extra work for nothing.

 

 

3D would be worse! 2D stuff is really quite simple in comparison, you have a bunch of objects, you update their X/y/frame etc and you're done! for 3D there is a lot more to do! So running instructions out of main RAM would slow down your whole computation section, this is a simple FACT.

 

Yes Atari Owl used it, but what for? was the core of his 3D engine running out of main? or perhaps he had filled Local RAM with that and simply shifted some of the lower frequency routines into main? The only person that can answer what he did and why would be the man himself. As I said, there will be edge cases where it's valid. It will be useful for larger chunks of code that won't fit into local, (but then these may be able to be broken down, but this then brings other obstacles, such as paging the code and vars in and out of the local RAM), it may simply be that those who have had a good reason to use GPU in main have done so purely for their own sanity, trying to juggle multiple routines and registers around in local RAM may have been more of a headache than dropping your CPUs effective speed 90%.

 

3D code doesn't run in anyway different to any other code, it benefits from fast RAM, probably more than 2D would...

 

When writign code you need to make decisions on how to do things, I am just trying to say that defaulting to GPU in main for every situation is wrong, and I mean that for 3D, 2D or anything else. It's there, it's possible but by no means is it the magic bullet, and IMHO if anything it should be avoided where possible.

Link to comment
Share on other sites

What's most amazing about the folks who have been carrying around the 'GPU in main as a panacea/3DPOWA' flag all these years is that they are usually the same people who are the TURN OFF THE 68K TO SPEED UP THE JAGUAR!!! crowd.

So... turn off the 68K because of possible bus contention - but run GPU in main so it can both run massively slower and log jam the bus at the same time...
  • Like 1
Link to comment
Share on other sites

Hi Remo. You seem to really dislike the whole idea of the GPU in Main thing. You certainly are very vocal about it. However 'Panacea' is not really what anyone is talking about and yet you keep throwing that out there. What has been said is that for 3D it will HELP the Jaguar do better at that. Not make it a PSX, or an N64, but help it to do better than what we've seen.

 

 

What's most amazing about the folks who have been carrying around the 'GPU in main as a panacea/3DPOWA' flag all these years is that they are usually the same people who are the TURN OFF THE 68K TO SPEED UP THE JAGUAR!!! crowd.

So... turn off the 68K because of possible bus contention - but run GPU in main so it can both run massively slower and log jam the bus at the same time...

 

 

Your last sentence, even from my limited technical knowledge just seems to be totally disregarding every proponents argument for it. Never really addressing anything just keep saying the same statement over and over. Can you break down anything that has been said previously by AtariOwl or Gorf and disprove it? Not for what Typo was doing but for actual 3D stuff?

Edited by JagChris
Link to comment
Share on other sites

What has been said is that for 3D it will HELP the Jaguar do better at that.

 

NO. It will NOT. What it will do is make the GPU run 90% slower - how do you think that basically crippling one of it's most powerful processors is going to help it go faster, no matter what it would possibly be doing ?

 

I'm going to be blunt here, I have a good fundamental understanding of the Jaguar's architecture and could program for it (though I have no desire to). If I were to use the GPU in main technique there is one and only one reason I would do it - and that's because I probably could not/would not program the RISCs efficiently enough for the code to remain in local. That's it...

  • Like 1
Link to comment
Share on other sites

 

Please be more specific. How will it do this?

 

This is exactly what Linkovitch is showing above. main RAM is massively slower compared to local RAM, it's a simple fact. If the GPU is running instructions out of main RAM, it's sitting around on it's thumbs for most of its cycle time waiting for fetches. I honestly don't mean to sound condescending here, but if you don't understand that I can't make it any simpler.

  • Like 1
Link to comment
Share on other sites

 

This is exactly what Linkovitch is showing above. main RAM is massively slower compared to local RAM, it's a simple fact. If the GPU is running instructions out of main RAM, it's sitting around on it's thumbs for most of its cycle time waiting for fetches. I honestly don't mean to sound condescending here, but if you don't understand that I can't make it any simpler.

 

I understand that completely. Absolutely. The recommended use was time critical stuff in the fast local ram and the AI and stuff not so time critical out in main. Yes a C compiler would be a good use of all this. The contention has been that swapping EVERYTHING as far as gpu code goes in and out of local is time consuming. Far more time consuming than running what can be run out in main ram. You will lose whatever speed advantage local ram has and THEN SOME.

 

If you have a little 4k, 8k program or maybe even a bit bigger, swapping code is the way to go. But if you have a BIG program, like Doom, like anything that takes all the memory then all that swapping in and out of local takes its toll.

 

AtariOwls blog addressed speed comparisons of the GPU vs the 68k out in main as far as all that goes as well and no one has disproven that yet.

Edited by JagChris
  • Like 1
Link to comment
Share on other sites

So....

 

writing your code in one lump and running it in main 10x slower than running from local is somehow magically faster than coding your modules in 2k chunks and swapping the next one in while the current one is executing?

 

BTW, Doom copies 4k chunks into GPU ram. If it ran from main you'd be looking at dropping the frame rate by at least 1/2 to 2/3 of the current one. How is that a massive performance boost?

 

Copying 4k of code in a block (using the blitter) is trivial in comparison to a 100% all the time 10x slower execution speed. You really, really need to DO THE MATH.

Edited by CyranoJ
  • Like 2
Link to comment
Share on other sites

Look CJ, your words are completely going against those who done it. And the 'writing the code in one lump and running it in main' comment is not what was said. You don't want to be 'arsed' then that's fine. Those who have done it have their stuff up, like Surrounded. If its so easy then duplicate its performance with the Atari Renderer that eveyrone has access to. I'm sorry but before you said 'this shit is easy.' Someday to rest the debate you'll have to be arsed. Until then, with all due respect its just words.

 

So....

 

writing your code in one lump and running it in main 10x slower than running from local is somehow magically faster than coding your modules in 2k chunks and swapping the next one in while the current one is executing?

 

BTW, Doom copies 4k chunks into GPU ram. If it ran from main you'd be looking at dropping the frame rate by at least 1/2 to 2/3 of the current one. How is that a massive performance boost?

 

Copying 4k of code in a block (using the blitter) is trivial in comparison to a 100% all the time 10x slower execution speed. You really, really need to DO THE MATH.

 

As for the other thread where I used the word 'deceptive', well I am sorry. I just don't know how else to put it. Misleading? Terribly mistaken maybe? Maybe I should of used that.

 

You know what CJ, for all the crap we give each other I have to hand it to you. You never quit like the others have. You're still here.

Edited by JagChris
Link to comment
Share on other sites

Wait up, you want me to duplicate the performance using the Atari Renderer. Which will somehow prove that 3D code is faster running on GPU in main. How does that work, exactly?

 

BTW, Here is 'pick a random number between one and 6'

 

move.l U235SE_rng,d0

and.l #$00030003,d0

move.w d0,d1

swap.w d0

add.w d1,d0

bne.s .notzero

addq #1,d0

.notzero:

 

there you go, 1-6 randomly generated. Don't really think running that on the 68000, the gpu (in local, or in main) or indeed on a hamster in a wheel, will make the setup for the renderer any quicker, do you?

 

Your words are completely going against all the evidence listed, by everyone, everywhere.

 

Believe what you want in one hand, and shit in the other. I guarantee you one will fill up faster than the other.

Edited by CyranoJ
  • Like 1
Link to comment
Share on other sites

This may have been mentioned somewhere along the line but I'm too confused to pick it up.

 

Is the middle ground to use main memory as a cache for code about to be loaded into local RAM? Or, is the speed of ROM good enough to serve the same purpose?

 

I'm talking way above my head so forgive me if the above is gobbledegook. :P

Link to comment
Share on other sites

LARGE programs CJ. LARGE programs. Benefit is in LARGE programs.

 

Wait up, you want me to duplicate the performance using the Atari Renderer. Which will somehow prove that 3D code is faster running on GPU in main. How does that work, exactly?

 

BTW, Here is 'pick a random number between one and 6'

 

move.l U235SE_rng,d0

and.l #$00030003,d0

move.w d0,d1

swap.w d0

add.w d1,d0

bne.s .notzero

addq #1,d0

.notzero:

 

there you go, 1-6 randomly generated. Don't really think running that on the 68000, the gpu (in local, or in main) or indeed on a hamster in a wheel, will make the setup for the renderer any quicker, do you?

 

Your words are completely going against all the evidence listed, by everyone, everywhere.

 

Believe what you want in one hand, and shit in the other. I guarantee you one will fill up faster than the other.

Link to comment
Share on other sites

OOOH a real question!

 

Yes, the middle ground is to page in the next segment while the current segment executes. ROM will be slower in theory - however as long as its there when it's needed, it won't really matter.

 

However, there is no C compiler for the GPU, there specifically isn't one that spits out 2k or 4k code chunks, and it's highly unlikely there ever will be. And even if there was, you can be sure it wouldn't spit out code as optimized as hand written assembler. So there'd just be another argument when this theoretical compiler was written and still wasn't fast enough to fulfill the wishes of the faithful. So it's all a bit moot really.

 

This may have been mentioned somewhere along the line but I'm too confused to pick it up.

 

Is the middle ground to use main memory as a cache for code about to be loaded into local RAM? Or, is the speed of ROM good enough to serve the same purpose?

  • Like 2
Link to comment
Share on other sites

Then your Surrounded example makes even less sense. Because all that does is pick a random number between 1-6 and then bring an object in from a corresponding edge. You could code that on a 2k VCS cart.

 

One hand getting heavy yet?

 

LARGE programs CJ. LARGE programs. Benefit is in LARGE programs.

 

  • Like 2
Link to comment
Share on other sites

OOOH a real question!

 

Yes, the middle ground is to page in the next segment while the current segment executes. ROM will be slower in theory - however as long as its there when it's needed, it won't really matter.

 

However, there is no C compiler for the GPU, there specifically isn't one that spits out 2k or 4k code chunks, and it's highly unlikely there ever will be. And even if there was, you can be sure it wouldn't spit out code as optimized as hand written assembler. So there'd just be another argument when this theoretical compiler was written and still wasn't fast enough to fulfill the wishes of the faithful. So it's all a bit moot really.

 

 

I suppose the dream compiler would be something like uc65

http://forums.nesdev.com/viewtopic.php?f=2&t=10242

 

Except with the added feature of intelligently handling code/data streaming to the GPU.

 

Where is my barrel of infinite code monkeys?!?

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...