IGNORED

# new_bjl update: polygon

## Recommended Posts

##### Share on other sites

That's great, this makes a valuable contribution to the subject.

I have had a look and the gpu source for poly_mmu is there -it is the file listed as poly_mmu.js.

##### Share on other sites

Yes, but not very well documented. Back in time I thought that code documentation is for lamers. ?

##### Share on other sites

Boy, it took me longer to understand my own code after 30 years then I needed to write it in the first place. ?

• 3
• 1

hehe

##### Share on other sites

On 5/20/2022 at 10:55 AM, 42bs said:

Yes, but not very well documented. Back in time I thought that code documentation is for lamers. ?

From Real Programmers Don't Eat Quiche - "Real Programmers don't comment. If it was hard to write, it should be harder to read, and even harder to modify."

?

Edited by Chilly Willy
formatting
##### Share on other sites

Funny (or better frustrating) is that the "ball" with 144 faces takes 29ms to draw. But if I disable actual drawing (means do no start the blitter), it takes 27ms.

So 94% of the time is needed for rotating and projection.

I think the projection is the main problem: 134 points means 268 divisions

##### Share on other sites

51 minutes ago, 42bs said:

Funny (or better frustrating) is that the "ball" with 144 faces takes 29ms to draw. But if I disable actual drawing (means do no start the blitter), it takes 27ms.

So 94% of the time is needed for rotating and projection.

I think the projection is the main problem: 134 points means 268 divisions

I assume it is not feasible to use a lookup table for the divisions?  Or could a fast fixed point routine be used?

##### Share on other sites

The division is simple, not float.

```****************
* 3D->2D
*          (x'+x_pos)*dist
* x_proj = ---------------
*           z'+z_pos+dist
*
*          (y'+y_pos)*dist
* y_proj = ---------------
*           z'+z_pos+dist
****************```

The problem is, you need to take the z_pos into account. I did, for a try, replace the "div" by a "shrq" but then you cannot move in Z direction.

But having a table of 1/(z'+z_pos) and then replace the divide by multiiplication ...

Some speed up can be to remove the "B_DSTEN" bit.

##### Share on other sites

5 hours ago, 42bs said:

Funny (or better frustrating) is that the "ball" with 144 faces takes 29ms to draw. But if I disable actual drawing (means do no start the blitter), it takes 27ms.

So 94% of the time is needed for rotating and projection.

I think the projection is the main problem: 134 points means 268 divisions

would be possible to split that 'division' task between the GPU and the DSP?

maybe would be possible to reach less than 20ms per frame

##### Share on other sites

In this demo it is kinda unlikely, that the DSP would be of any help. Since it is a single stream of computation each frame: rotate, project, draw.

Maybe, it is sufficient, to use the average (z`+z_pos) and calculate the reciprocate once and then do a multiplication.

##### Share on other sites

Q1: What happens to execution time if blit all (or chunks of) point data into GPU local RAM at once, instead of using loadp inside the innnermost loop?
Q2: Is the ALU pipeline stalling, waiting for mmults and divs to complete?

Q3: Same as Q1, but in regards to store of projected points in innermost loop (store y1, (proj_ptr)).

Q4: Any chance of per-point page misses with the current loadp (Q1) and store (Q3), in the innermost loop?  Can the loaded 3d points be on a different page then the saved projected points, resulting in page miss?

Edited by jguff
##### Share on other sites

A2: 'div' stalls the pipeline. I could add a lot of NOPs after the div and the use of the result without seeing any change in the time needed.

But I have yet no idea what to place in this gap.

##### Share on other sites

On 5/22/2022 at 12:10 PM, 42bs said:

But if I disable actual drawing (means do no start the blitter), it takes 27ms.

So 94% of the time is needed for rotating and projection.

Q5) Does above mean just commented out line of code that starts blitter?  Or does above mean comment out all the code for drawing (calculation of which surfaces are hidden, line drawing, setup of blitter for each poly, etc)?

##### Share on other sites

Only the actual drawing. So 90% of the time is spent for the calculations.

##### Share on other sites

10 hours ago, jguff said:

Q4: Any chance of per-point page misses with the current loadp (Q1) and store (Q3), in the innermost loop?  Can the loaded 3d points be on a different page then the saved projected points, resulting in page miss?

Projected points are stored in GPU RAM.

I tried also to load the points via Blitter into internal RAM: No benefit, one reason: Have to wait for the blitter to clear the screen.

##### Share on other sites

11 hours ago, 42bs said:

Projected points are stored in GPU RAM.

I tried also to load the points via Blitter into internal RAM: No benefit, one reason: Have to wait for the blitter to clear the screen.

Blit clear the non-active frame buffer after blitting 3d points into GPU RAM?  If 3d points and all code logic fit into RAM, you only need to transfer 3d points from memory/cartridge once.

Actually, now that i look closer, i see the source has been updated.  Looking at the bottom of file, looks like 3d points and projected points are both now residing in GPU RAM.

Edited by jguff
##### Share on other sites

Q6: Is there a static code analyzer available that would provide profile of theoretical best case performance?
Would be curious to see how much time is spent in rotation/projection, vs the remaining code that draws (but doesn't call blitter) in the 27ms case.

Edited by jguff
##### Share on other sites

Q7: Looks like faces (eg: faces_kugel) could be packed better.
index offset 133*4=532.  532 is taking up 2 bytes, and looks like stored in 4 byte long, whereas 133 could be stored in single byte.  Use loadp or blitter to load multiple indices into GPU.  Maybe doesn't buy anything after doing shift/anding in GPU.  Packing main memory won't help, if doesn't aid GPU getting the job done faster.

Q8: Edge lists/tables could possibly be optimized to take up less space and less initialization time.
Iterate over vertices of polygon to find min/max y.  Only reserve memory and only initialize memory for poly_y_max-poly_y_min rows of list/table.  Would need to add field for min and max y.
Seems like currently initializes 200 rows times 72 faces (half the 144 faces are hidden), per frame.  Clock cycles add up, even in extremely tight loops.  14,000 iterations times number-of-cycles-per-iteration may add up.  Believe there are little less then 1,000,000 cycles available per frame, at 30FPS.
This won't help if individual polys take up significant portion of y dimension of screen.

Edited by jguff
##### Share on other sites

Really love seeing these kind of optimizations.  8-bit coder by hobby, but usually doing SQL, or existing code optimizations for day job.  Either way - this is way outside my normal day to day, it's good to still be excited by code, and want to learn new stuff after 15 years of the mundane.

##### Share on other sites

Q9: Is there any way to reduce the number of bresenham populations of edge list/tables, by factor of 2?

The polygons seem to lay against one another, essentially sharing the same line on the display.  Would be nice if you could reuse bresenham calculations of PolygonN when bensenham calculating edges of polygons to sides of PolygonN.

May not be possible if you eventually calculate shade values for each entry in edge list/table.
May be easier to implement when/if you switch to span list/table.

##### Share on other sites

Yes, there is likely some benefit from packing stuff into bytes instead of using longs. But the demo was meant as starting point for more complex 3D stuff.

So having a landscape with a lot of 3D objects might have to many points to fit into the GPU RAM. But then span buffer might also be needed.

I doubt that the extra calculation needed to find shared edges pays off.

As for the min/max X buffer for the drawing: There is not much difference if a object fills the whole screen or is small and moved down. The max Y is not needed, as the drawing routine exits as soon as it finds a min/max X which is untouched. So it is only the min Y which needs to/can be optimized.

I am currently preparing stuff for Outline 2022, but from Monday on, I will certainly do some more tests on this routine.

##### Share on other sites

6 hours ago, jguff said:

Looking at the bottom of file, looks like 3d points and projected points are both now residing in GPU RAM.

I left the copy routines in. So one can play with it. I tried Blitter copy (clear screen problem) or by hand copy. Both did not bring any benefit.

I still think, the biggest "killer" are the two divides per projected point. In a real 3D game, there are likely more ways to optimize this.

##### Share on other sites

4 hours ago, jguff said:

LOADP does not bring any benefit due to the HIDATA bug. That is any STORE might destroy it. So LOADP must be followed very shortly by a load from HIDATA.

The real benefit is in tight copy or memset loops. But since it cannot be used to store to internal RAM

##### Share on other sites

Not exactly : HIDATA is crashed after any external LOAD, but is keeped for any STORE(internal & external) and internal LOAD.

You can check my ST2JAG optimised code (ST2JAG) to see how I use it.

I have quiclky read the GPU code, and there is some optimisation just by instruction reordering : many cycles are lost due to register write back conflicts

I'm not sure how many register is used in the secondary register bank, but maybe you can also push more movei value into it and use movefa more often

Edited by SCPCD

## Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

×   Pasted as rich text.   Paste as plain text instead

Only 75 emoji are allowed.