Jump to content
IGNORED

GPU in main, SCIENCE!


LinkoVitch

Recommended Posts

1st up, starting by refactoring existing code that probably isn't the most efficient in the 1st place isn't going to get you the best results. Your example doesn't really explain much of anything. You seem obsessed with 3D being some kind of magic operation, it's not.

 

That doesn't really explain anything in any clear way, the benefits of any one change are not detailed, as such it is impossible to know which thing (from your brief summary) provided what improvement.

 

 


Wht Scott (JagMod) did to speed up the renderer is get rid of the 68k code all together,

 

Yup, this will likely speed up a 3D renderer, no argument there, doesn't really state where that code is running. My 1st module player was 99% 68K code, it struggled with 4 channels and 12kHz, current code is 100% RISC (save the init code which isn't part of playback) and will happily do 8+ channels at 32kHz (if you have the bus time free). So yeah, removing 68K makes stuff go faster.

 

 


eliminate all the blits of the code to GPU local and put them in to one big chunk.

 

Sooo Is this one big chunk of code in GPU Local? running it in Main? what ? it doesn't say, it is not clear what has been done here.

 

 

 

The other issue with the Atari renderer is it has the GPU stop completely

while the blitter keeps loading in vertices to the RAM. What should be happening is that the verts should be out in main
and keep the GPU running at all times. OR use a small portion of the local to have the blitter move the verts to WHILE
the GPU is constanly processing them or at least doing something else while waiting for the blitter to load the next set.

 

Which RAM? this is also not clear, I'd assume it is referring to Local RAM. There are a few issues I'd see,

 

1) Again this is referring to a problem with the Atari code, so what's your point?

2) The RAM in the GPU (and main) is NOT dual ported, as such only one device will be able to access it at any one time (I'd assume based on RAM pages, but that depends on the size of these, and given the corner cutting action that has gone on in the jag, I'd assume that the Local RAM can be accessed by one device at a time, I am sure someone like SCPCD would know for certain, failing that some experimentation could possibly prove it). So as the RAM is only single ported, even if a small section of Local RAM was reserved for 2 banks of verticies, (one for working on, one for the next set to be loaded into), which I agree is a sensible design, it would be moot given that when the blitter was writing to the RAM the GPU core wouldn't be able to access the RAM at the same time.

 

As there is also only one main bus, it would also not be possible for the GPU to access main RAM whilst the blitter was reading from it (or writing to it) for the same reason. We're back to that single bus again, if the GPU is reading across the main bus to fetch it's code it is hogging the bus much like the 68K but not as much of a hog.

 

 

 

So if you have a large program that you have to put everything into the GPU local, and keep swapping stuff out? How much time does that eat up? Is that not the heart of the matter? Those who are proponents of it claim that in some instances the swapping eats up that speed difference real quick.

 

I haven't timed this, and have no access to a jag to test and prove it, but if we assume a 32bit blit (GPU RAM is only 32 bit), 4K would be 1024 32bit reads, I am not sure of the benefits the blitter will gain from sequential read access, I'd assume there is some benefits to be had here, I would assume at least twice as fast as the GPU reading those same instructions directly from RAM. My logic behind this is that in my test of 1200 instructions, the instructions were all 16bit in size, hence reading in a 32bit word would shift twice the amount of data across the bus as would be shifted by a 16bit read. Of course my test was based on the execution of those instructions not simply the reading of memory, so there would have been additional delays added to the timing that I would assume copying would not.

 

Of course once those instructions were in GPU RAM they would then RUN at 10x the speed of if they were in main, so assuming a really poor blitter copy taking 20000 ticks and then 4000 ticks to execute giving a total of 24000 LinkoTicks ( :) ) for a copy and run, vs aprox 40000 LinkoTicks to run from main directly.. This is all based on ropeymaths but I am fairly confident that the numbers will be off but similar, as I said, not near a jag to test right now, but I will when I get home and get a mo, and I'll post my results too, with method.

 

Don't forget you typically don't blit the code out, you would just over write it. Variables stored in GPU RAM could possibly need to be copied out, but this depends entirely on the code.

 

You can get a lot of code into 4K of RAM, the entire of my sound engine (if you ditch it's buffers) is less than 4K, and that's a Sound Engine, Module Parser, Random Number Generator and Pad reading code, with 3 active interrupts and a constantly running loop.

 

The best solution would be rewrite a new engine from scratch with a strong design intended to run in an optimum way for the Jaguar rather than refactoring code that wasn't originally intended that way.

  • Like 1
Link to comment
Share on other sites

3D does seem to be the magic formula for this. As for the swaps causing slowdown here is more evidence this may be the case:

 

 


One of the things that degraded performance in WTR wasn't the speed of the
polygon renderer, it was the GPU overlay manager-- lots of chunks of code
were being swapped in and out of the small 4K of memory within that chip
during the game and that was something I really wanted to tidy up.

 

Quote from Lee Briggs. This was before the GPU bug workaround was found but he seems to be implying that too much swapping is a performance killer.

Link to comment
Share on other sites

It seems you broadly feel that AtariOwl should do something similar before he decides to code anything again. Didn't you say you feel that his engine would of run faster if he'd not wasted time using the GPU in Main techniques he did. Broadly so?

 

 

 

I kindly suggest you either start to develop an in depth understanding of the Jaguar's hardware, or let the folks who know what they are doing and why they do it continue the conversation if they wish.

 

You aren't doing yourself any favors by repeatedly bringing this up without having any of the required knowledge, and then proceeding to stumble around and resorting to saying 'ask them why'

Link to comment
Share on other sites

No, nobody said that. At all.

 

We said if he'd got it all running in local it would be faster. We said that the bits running in main would be faster than the 68000, or, because they were smaller code segments, faster than code-module-swapping. We've said that edge-case code can, indeed, run faster in main. And AtariOwl agreed with it.

 

We're back to that comprehension thing again.

 

You can stop banging on about this now.

 

Didn't you say you feel that his engine would of run faster if he'd not wasted time using the GPU in Main techniques he did. Broadly so?

 

  • Like 5
Link to comment
Share on other sites

3D does seem to be the magic formula for this. As for the swaps causing slowdown here is more evidence this may be the case:

 

 

 

One of the things that degraded performance in WTR wasn't the speed of thepolygon renderer, it was the GPU overlay manager-- lots of chunks of code

were being swapped in and out of the small 4K of memory within that chip

during the game and that was something I really wanted to tidy up.

 

Quote from Lee Briggs. This was before the GPU bug workaround was found but he seems to be implying that too much swapping is a performance killer.

 

No. He's saying that their "GPU overlay manager" could have been better, the renderer was not the issue. You have no idea how that worked, you cannot claim to know how to make it any better. FFS, just please cease and desist.

 

 

It seems you broadly feel that AtariOwl should do something similar before he decides to code anything again. Didn't you say you feel that his engine would of run faster if he'd not wasted time using the GPU in Main techniques he did. Broadly so?

 

Broadly speaking Owl can do whatever pleases him. He stated he's probably done with the Jaguar and was looking at Dreamcast.

 

Broadly speaking, you should probably STFU when it comes to anything more than "hello world" for your own good any everyone else's sanity.

  • Like 2
Link to comment
Share on other sites

Although it's obviously annoying to retread the same territory over and over again I do appreciate it. It helps let the architecture sink in. Also, documentation can only go so far - you guys have dealt with the gotchas.

 

I'll take this discussion over code review Wednesdays any day!

 

I am happy to reiterate stuff for that purpose, sometimes one explanation of something doesn't quite sink in with some people (myself included) and rephrasing it can be a huge help. I have no problem with this at all, happy to help in fact if I can. I find on the most part discussion of the Jag's quirks to be quite interesting, I don't find having someone reitterate the same thing repeatidly without seemingly being able to understand or try to understand what they are saying constantly trying to take me down about something however. That is just annoying.

 

JagChris, I am not trying to fling mud on anyones work, so you can please stop trying to make out I am. Join in the conversation, but PLEASE do so with your own ACTUAL understanding rather than repeating summarisations from others as gospel facts, they don't hold any real detail, and you also cannot follow up with answers to any questions that are posed against them. Why not take this time to look at what's been done and fire up your own JagDev setup and have a play, see what you can find out. Don't take anyones word for anything, attempt something, record your findings and present them along with your method. This way people can learn what has been done and how it's been done, much more useful than copying a high level description someone else wrote about code you have never seen.

  • Like 6
Link to comment
Share on other sites

I don't have the jag specs nearby, but what is the speed difference between running the instructions from the 4k cache vs RAM ?

 

Even at 10x speed-up, there is a significant overhead of moving code&data in first and then repeating that for multiple code chunks.

 

It's easy to come up with a scenario for ONE piece of code where the difference will be dramatic. But if you want to do it for all code ?

 

If you have 25 chunks of code you swap in&out, you pay the price of 25 memory transfers. And then you have to transfer the calculated data back (from fast cache into slow RAM).

So, now the execution time in fast cache must offset two things - transfer of code&data into cache and then transfer just the data back. And of course, the total time must be shorter than if none of that ever happened. That is quite a lot of work...

 

Besides, how is this comparison even valid ? We can't compare execution time of 68k instructions with RISC instructions ! We would have to rewrite the algorithm first from 68k ASM into RISC ! That's a lot of effort, right there...

 

I can imagine this making a lot of sense with 3d transformations, but since the GPU's RISC instruction set makes calculations in few ticks anyway (compared to 68k), that's not really a comparison.

So, why do we compare execution time of a RISC instruction set with a CISC instruction set ? What am I missing ?

Link to comment
Share on other sites

Err no.

 

You wouldn't transfer the calculations in and out. You'd run the code from local, read the data ONCE from main, and write the result ONCE back to main.

 

There is no benefit to copying the data from main to local, reading it again from local, processing it, writing it back to local, and then writing it back to main.

 

Lets say you transform 2000 points. Do you really think a 4k blit is significant compared to executing a loop 2000 times? Whatever that loop is doing, if its doing it 2000 times from main ten times slower, the blit+local execute is going to win every time.

 

Now if you are blitting in routines you run once, that don't loop much, then yes, the local execution will be faster. Then again, you can fit a hell of a lot in 4k. A full render engine should fit with room to spare for at least a few of those pesky edge case routines.

 

 

 

If you have 25 chunks of code you swap in&out, you pay the price of 25 memory transfers. And then you have to transfer the calculated data back (from fast cache into slow RAM).

So, now the execution time in fast cache must offset two things - transfer of code&data into cache and then transfer just the data back. And of course, the total time must be shorter than if none of that ever happened. That is quite a lot of work...

  • Like 2
Link to comment
Share on other sites

 

No. He's saying that their "GPU overlay manager" could have been better, the renderer was not the issue. You have no idea how that worked, you cannot claim to know how to make it any better. FFS, just please cease and desist.

 

 

 

Broadly speaking Owl can do whatever pleases him. He stated he's probably done with the Jaguar and was looking at Dreamcast.

 

Broadly speaking, you should probably STFU when it comes to anything more than "hello world" for your own good any everyone else's sanity.

 

I did not say the renderer was the issue. What I believe he is saying is the overlay manager which to my understanding is swapping the code in and out of the 4k cache needs to be as efficient as possible or it degrades performance if its swapping code too much. The less the better. Now its entirely possible I am misunderstanding that. Since you understand it so clearly than FFS please enlighten me.

Link to comment
Share on other sites

OK fair points. Now help me with my comprehension. How would you outline AO should do his code differently? Remo made it sound like AO did it all wrong.

 

 

 

 



You mean to say that he could have managed to get it working without the GPU in main stuff it and then it would have been even faster? Or could this be a special case?


GPU in local is hugely faster than GPU in main, so broadly yes.

 

No, nobody said that. At all.

 

We said if he'd got it all running in local it would be faster. We said that the bits running in main would be faster than the 68000, or, because they were smaller code segments, faster than code-module-swapping. We've said that edge-case code can, indeed, run faster in main. And AtariOwl agreed with it.

 

We're back to that comprehension thing again.

 

You can stop banging on about this now.

 

Link to comment
Share on other sites

 

I did not say the renderer was the issue. What I believe he is saying is the overlay manager which to my understanding is swapping the code in and out of the 4k cache needs to be as efficient as possible or it degrades performance if its swapping code too much. The less the better. Now its entirely possible I am misunderstanding that. Since you understand it so clearly than FFS please enlighten me.

 

Relevance to this thread? Enlightenment: if you wish to discuss what you think some guy said in some interview, make a thread, don't pollute this one further.

 

OK fair points. Now help me with my comprehension. How would you outline AO should do his code differently? Remo made it sound like AO did it all wrong.

 

Go PM Owl if you want to know what he would do. The only person advocating blindly following commandments is yourself.

 

 

giphy.gif

 

  • Like 2
Link to comment
Share on other sites

Put simply, no.

 

Go do your own research, go write your own code, go come to your own conclusions (founded in science, with some facts) instead of repeating blindly what you heard a guy tell his mate down the pub.

 

Owl can write his code however he wants to. It's his code. If he's not happy with it he can chip away for that extra 1% speed. Or he can say 'fuck it, I'm done' and accept where he is. How anyone else thinks it should be done is immaterial unless he specifically asks for input.

 

To top all that off, why would anyone outline to you how they think Owl should write his code? Are you then going to take that to Owl with 'You should do it this way' ? No, don't think so.

 

Please stop trying to stir up trouble with comments like 'Remo made it sound all wrong' - it's not only inaccurate, but makes you look like a troll.

 

OK fair points. Now help me with my comprehension. How would you outline AO should do his code differently? Remo made it sound like AO did it all wrong.

 

 

 

 

 

  • Like 3
Link to comment
Share on other sites

Can I request all the cruft be removed from this thread? The programming discussion isn't for coffee table chats about what JagChris thinks he thinks someone else thinks. It's really made a mess of linko's efforts... and as I believe he's planning more experiments, future jaguar coders aren't going to want to sift through all this drivel to get to a scrap of info that might be nestled between vast bouts of derpage.

  • Like 7
Link to comment
Share on other sites

I'm not going to take the time to remove the cruft. It's not my fault that the discussion got off course, so I shouldn't be the one punished for it. Create another thread if you want. However, further discussions going off the mark will be moderated.

Link to comment
Share on other sites

Sorry for the headaches Sauron, wasn't my intention to stir up any arguments, only to hopefully share info and findings hopefully of some use.

 

I will post my experiments and results on my site as per the 1st ones, and may fire up a new thread if it makes sense. People can always go to the U235 website to find out the info and do the tests themselves. I have no intentions of stating that programmer X is wrong (without empirical evidence), and certainly am not going to comment on someones approach based on hearsay and conjecture, if I had their code in front of me and they had asked me to. There is no way I can comment on what I cannot see, and tbh I don't think anyone else can either. Anyway I'm rambling.

 

Hope to get the next few tests fired off when I get home at the start of this next week, already mind coded up some tests :)

 

I plan to try:

 

  1. timing copying of code from main to local RAM with the blitter.
  2. Testing if the blitter accessing local RAM with the GPU running impacts the performance of the GPU

Hopefully the results will be interesting and useful, I will post the code up too so others can see my methods etc.

  • Like 7
Link to comment
Share on other sites

It lives it it's own little happy merry world doing it's own thing (much to the annoyance of CJ at times :D, at least I hope he hasn't taken a shotgun to the Jag to stop it running :D :D )

 

U235 SE runs in the DSPs local RAM except when it needs to fetch sample data. Which when it does it fetches a 32bit word (so 4 samples) at once, then uses them as required. It populates a couple of playback buffers internal to the DSP and plays those to the DACs, when one empties it switches to the other and refills the empty one. If there was more cache available to Jerry it would hardly touch the bus, alas there isn't much for sample data, although there are some plans on the todo list that should hopefully reduce the bus activity.

  • Like 5
Link to comment
Share on other sites

Err no.

 

You wouldn't transfer the calculations in and out. You'd run the code from local, read the data ONCE from main, and write the result ONCE back to main.

 

There is no benefit to copying the data from main to local, reading it again from local, processing it, writing it back to local, and then writing it back to main.

 

Lets say you transform 2000 points. Do you really think a 4k blit is significant compared to executing a loop 2000 times? Whatever that loop is doing, if its doing it 2000 times from main ten times slower, the blit+local execute is going to win every time.

 

Now if you are blitting in routines you run once, that don't loop much, then yes, the local execution will be faster. Then again, you can fit a hell of a lot in 4k. A full render engine should fit with room to spare for at least a few of those pesky edge case routines.

 

 

Well, of course, if you pick a case where you loop through an array 2000 times that it's easily going to offset the copying overhead.

 

I was talking about a simple case of a code that does maybe a short loop (up to 10 iterations) but most of the time, just a serial execution. I don't believe there would be much savings (if any), if there was just a basic, serial execution without any looping.

 

Of course, one would have to make benchmarks, but from my experience it is pretty easy to calculate the cycles in a loop with a reference manual to get an idea about the potential difference.

 

 

 

Oh, and the example with transformation is kinda irrelevant with Jaguar. If I was to implement a 3D transformation in ASM, it would be a horrible waste of time, effort and especially performance, if I did it using 68k ASM and not using the RISC ASM, where math operations take merely few cycles (of a double-speed core compared to 68k core).

Link to comment
Share on other sites

Thanks CJ. Was just curious. If you guys can get even more out of the system than AtariOwl got then thats awesome! Almighty impressive I must say.

 

Put simply, no.

 

Go do your own research, go write your own code, go come to your own conclusions (founded in science, with some facts) instead of repeating blindly what you heard a guy tell his mate down the pub.

 

Owl can write his code however he wants to. It's his code. If he's not happy with it he can chip away for that extra 1% speed. Or he can say 'fuck it, I'm done' and accept where he is. How anyone else thinks it should be done is immaterial unless he specifically asks for input.

 

To top all that off, why would anyone outline to you how they think Owl should write his code? Are you then going to take that to Owl with 'You should do it this way' ? No, don't think so.

 

Please stop trying to stir up trouble with comments like 'Remo made it sound all wrong' - it's not only inaccurate, but makes you look like a troll.

 

Link to comment
Share on other sites

Thanks CJ. Was just curious. If you guys can get even more out of the system than AtariOwl got then thats awesome! Almighty impressive I must say.

 

 

Personally, I'd rather hear more about any progress you've made using C. I stopped at getting the 64 bit version of make to work. It's always encouraging to hear more from those that have bravely plowed ahead :)

Link to comment
Share on other sites

Personally, I'd rather hear more about any progress you've made using C. I stopped at getting the 64 bit version of make to work. It's always encouraging to hear more from those that have bravely plowed ahead :)

 

Thanks CJ. Was just curious. If you guys can get even more out of the system than AtariOwl got then thats awesome! Almighty impressive I must say.

 

Personally I'd rather he disappeared all the way up Stevie's backside instead of keep popping out to impart his pearls of wisdom in this thread. His motives before were questionable. His outright trolling now clarifies his agenda all too clearly.

 

Such actions do nothing to promote or further Jaguar development and understanding, they only muddy already cloudy waters and reenforce external thoughts that Jaguar fans are all batshit insane.

 

Pathetic.

  • Like 6
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...