42bs Posted May 25, 2022 Author Share Posted May 25, 2022 18 minutes ago, SCPCD said: many cycles are lost due to register write back conflicts I started to heavily reorder to reduce stalls. But there was not much benefit. At least non visible. I tried to move the "LOAD" of the next X/Y after the divide , but this resulted in a glitches. So there seems to be another bug in the GPU. 21 minutes ago, SCPCD said: Not exactly : HIDATA is crashed after any external LOAD, but is keeped for any STORE(internal & external) and internal LOAD. I have different experience. But need to find out. Maybe I did something weird. 23 minutes ago, SCPCD said: ou can check my ST2JAG optimised code I will! Quote Link to comment Share on other sites More sharing options...
+selgus Posted May 25, 2022 Share Posted May 25, 2022 What I've done with vector processors in the past (like on the Playstation) is unroll the loop and process however deep your pipeline is, on the different stages of your calculations (in different registers). You can eliminate the stalls, still keep your code readable, and if you have an instruction cache, use that as a constraint of how much to unroll. Normally the divide pipelines are pretty deep, so you can hide those stalls by not using the results right away, and do a bunch other logic needed for your loop, before finishing your perspective correction. Quote Link to comment Share on other sites More sharing options...
42bs Posted May 25, 2022 Author Share Posted May 25, 2022 There can be only one divide which takes 16 cycles. So these must be filled somehow before the result is use (and the next divide may be issued). Maybe it is possible to do the next rotation/transformation. But I doubt that the resulting code is readable. ? Quote Link to comment Share on other sites More sharing options...
+selgus Posted May 25, 2022 Share Posted May 25, 2022 That is what I mean, you can do the beginning stages of the next vertex while waiting for the previous to complete, then grab the result. You can keep it readable if you disregards some of the advice around not commenting and document what you are doing. Quote Link to comment Share on other sites More sharing options...
jguff Posted May 25, 2022 Share Posted May 25, 2022 7 hours ago, 42bs said: I still think, the biggest "killer" are the two divides per projected point. In a real 3D game, there are likely more ways to optimize this. Those two divides seem to be far outweighed by other instructions, unless i'm missing something. You might be able to define a normal for each poly/face, then for each frame: 1) rotate each polys normal 2) analyze each polys normal to determine if poly will be hidden 3) rotate only vertices that corresponds to non-hidden polys 4) project only vertices that correspond to non-hidden polys Would require some reorganization of data layout probably. Probably speed up objects with high number of polygons (eg: kugel). Probably slow down objects with fewer polygons (eg: cube). Quote Link to comment Share on other sites More sharing options...
42bs Posted May 25, 2022 Author Share Posted May 25, 2022 18 minutes ago, selgus said: document what you are doing Oh, no. I am to old to learn writing comments ? Quote Link to comment Share on other sites More sharing options...
+selgus Posted May 25, 2022 Share Posted May 25, 2022 4 minutes ago, 42bs said: Oh, no. I am to old to learn writing comments ? Never too old to learn anything How I approach these types of functions is to write out one iteration, with NOPs filling in the pipeline before being able to use the results, for each instruction. Then get this working properly. Next I look at the NOPs and see how many times I can unroll the loop, filling in (preferably the same operation) for the next vertex, using different registers. Taking care of some house keeping leading-in and leading-out operations, you can get the optimal use of the processor cycles. Though I've never done this on the Jaguar's RISC processors, I have on other architectures. I wouldn't discount all the other stalls and focus only on the divides.. the others might add up too. 1 Quote Link to comment Share on other sites More sharing options...
42bs Posted May 25, 2022 Author Share Posted May 25, 2022 2 minutes ago, jguff said: 2) analyze each polys normal to determine if poly will be hidden Good point. I do it after projection. But should rather do it after rotation/translation. But this means re-ordering the code heavily from point orientated to face oriented. And also have to cache points which have been rotated/projected for other faces. Quote Link to comment Share on other sites More sharing options...
42bs Posted May 25, 2022 Author Share Posted May 25, 2022 2 minutes ago, selgus said: with NOPs filling in the pipeline Yes, I often did so. And then it hurts, if you still have some NOPs left, and no idea how to replace them. For an intro (not yet released) I decided to calculate 4 pixels in parallel to keep the pipeline filled as much as possible. But sometime the Jaguar bites back and what seems to be a valid interleaving makes things worse. 1 Quote Link to comment Share on other sites More sharing options...
jguff Posted May 25, 2022 Share Posted May 25, 2022 I'm worst case numberwanging 6000 cycles: 2 divs per point, 150 points, 20 cycles per div. Is my numberwang wrong? Quote Link to comment Share on other sites More sharing options...
42bs Posted May 27, 2022 Author Share Posted May 27, 2022 No, I think it is correct. But in a large scene one has likely more than just 150 points. Quote Link to comment Share on other sites More sharing options...
Positron5 Posted May 27, 2022 Share Posted May 27, 2022 As an experiment I tried to compile poly_mmu from the makefile and lyxass2022( also rmac) and I got a load of errors. One such warning was file poly_mmu equ is missing. ?? Quote Link to comment Share on other sites More sharing options...
42bs Posted May 28, 2022 Author Share Posted May 28, 2022 Just tried the current version on github together with the latest lyxass,rmac and rln: No problem. .equ is generated when assembling the .js file. Quote Link to comment Share on other sites More sharing options...
Positron5 Posted May 28, 2022 Share Posted May 28, 2022 35 minutes ago, 42bs said: Just tried the current version on github together with the latest lyxass,rmac and rln: No problem. .equ is generated when assembling the .js file. That's good, it is probably a problem with my tool chain. Quote Link to comment Share on other sites More sharing options...
42bs Posted May 29, 2022 Author Share Posted May 29, 2022 If you cannot get it to compile, feel free to pm the output. Quote Link to comment Share on other sites More sharing options...
Positron5 Posted May 29, 2022 Share Posted May 29, 2022 pm sent Quote Link to comment Share on other sites More sharing options...
Positron5 Posted June 5, 2022 Share Posted June 5, 2022 I would like to report on my attempts to compile @42BS's poly_mmu_68k source from: https://github.com/42Bastian/new_bjl/tree/main/exp/poly_mmu with further advice from Bastian his code needs his modified rmac( that provides option -4) and rln and his latest lyxass that can be obtained from his github repository. The variable BJL_ROOT needs to be declared( see export command) I have to say that I'm a user of cygwin64 under Windows8.1 and this causes an "can't find file or directory" error that breaks the build. If anyone wants to try this I would be interested to hear your results. 2 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.