Jump to content
IGNORED

Score board bug


42bs

Recommended Posts

Just to record/warn:

Following sequence work in VJ but not on real HW:

	xor     r3,r2
	btst	#4,r2
	addqt	#4,DISTANCE
	jr	ne,noc
	addqt	#1,LIGHT
	moveq	#0,r4

It seems the Z flag from the btst is not seen, but the one from xor.

An or r2,r2 between those fixes it.

Link to comment
Share on other sites

Ok, it is not the Z flag.

 

original sequence:

	xor	r3,r2
	loadb	(LIGHT),r4
	btst	#4,r2
	addqt	#4,DISTANCE
	jr	ne,noc
	addqt	#1,LIGHT
	moveq	#0,r4
noc:

Result: "jr" is _always_ taken?!

Working:

	xor	r3,r2
	loadb	(LIGHT),r4
	or	r4,r4
	btst	#4,r2
	addqt	#4,DISTANCE
	jr	ne,noc
	addqt	#1,LIGHT
	moveq	#0,r4
noc:

 

This is the "moveq" bug described in the manual. The write-back of the loadb comes after the writing of the #0.

 

  • Like 1
Link to comment
Share on other sites

that loadb is from the main ram?

loadb	(LIGHT),r4

 

I wonder how big the delay is?

E.g, which registers will have a new r4 (LIGHT) content.

loadb	(LIGHT),r4
move	r4,r5
move	r4,r6
move	r4,r7
move	r4,r8
move	r4,r9
move	r4,r10

Unfortunately I can't check that myself due to an issue with "EZ-HOST" driver or Skunk itself.

 

Link to comment
Share on other sites

@Cyprian As I understand how Jaguar's bus works it's something you just can't rely upon because the bus always can be taken by some master with higher priority (OP to name one). On the other hand I've read somewhere that it's a good approximation that reading from main RAM in optimal circumstances takes 10 cycles. I don't know what this number depends on.

 

Edited by laoo
Link to comment
Share on other sites

53 minutes ago, Cyprian said:

E.g, which registers will have a new r4 (LIGHT) content.

R5, as the read of R4 stalls until loadb has finished.

Interesting is this:
 

loadb	(LIGHT),r4
REPT m
nop
ENDR
moveq	#0,r4
move	r4,r5

How many NOPs are needed before r5 becomes 0 and not what was read from LIGHT

 

Link to comment
Share on other sites

Quote

How many NOPs are needed before r5 becomes 0 and not what was read from LIGHT

 

I added this:

	REPT 14
	movei	#100000,r9
	movei	#10,r8
	div	r8,r9
	ENDR

Then I see no more display errors.

 

And yes, LIGHT is in the main memory.

Link to comment
Share on other sites

  • 2 weeks later...
On 4/16/2022 at 10:53 PM, Cyprian said:

Wouldn't it better to copy the data from the main to the GPU with the blitter in that case?

I changed my intro to use the blitter for writing line by line to the DRAM, but there is no visible speed-up. The problem is, that the pixels are 16bit, so need to combine two in GPU RAM, which adds again some cycles.

Set simply, reading 320*240*6 bytes and writing 320*240*4 bytes eats a lot of time.
 

Link to comment
Share on other sites

ok.

Does the blitter work in parallel with the GPU, or does it just stop the GPU while blitting?

 

If they can work concurrently, then maybe interleaving the code with blitting would speed it a bit. E.g. process only a half of the line (two lines acutally) in a pass, load the next half on the beginning and save previous half in the middle of the code.

 

Anyway, I guess the code would not fit into 256 bytes.

Edited by Cyprian
Link to comment
Share on other sites

4 minutes ago, Cyprian said:

Does the blitter works in parallel with the GPU or stops it during blitting?

Oh, wait. Yes, I wait for the blitter to finish.

I should try double buffering the line I write back to RAM.

I have a tiny intro which fits into 64 bytes, so plenty of space to give it a try.

Link to comment
Share on other sites

10 minutes ago, Cyprian said:

cool

I checked and it is really interesting. I now wait _before_ I set up a new blit and it actually takes more time to prepare a line of 320 pixels than to write it with the blitter to the memory.

So it is ( for what I see) not possible to have a 320x240 generated picture updated every frame and using the blitter does (at least in my case) not have any advantage.

Or, I have somewhere a big bug which I do not yet see.

 

Link to comment
Share on other sites

do I understand correctly that reading/writing a whole line with the blitter isn't faster than reading/writing each pixel by the GPU separately?

 

BTW, I wonder if the blitter operates with 64bit or 32bit data at once. I mean read from/write to the main when it exchanges the data with the GPU RAM. Would be worth to check that with a logic analyzer.

 

Link to comment
Share on other sites

The point is, I am building a line of 320 16 bit pixels in GPU RAM. Each pixel takes about 30 cycles to compute, so the stall for writing to memory does not affect the calculation.

Since the GPU can only write 32bit and not 16bits (Edit: to internal RAM) I need to spend another 6 cycle to combine odd and even pixels.

 

From what I understand, the Blitter can only read 32bit wise from GPU RAM.

Edited by 42bs
Link to comment
Share on other sites

1 hour ago, 42bs said:

It is 30cycle for the computing of the pixel (more or less).

yep, I understand that, it was just mental shortcut.

I wonder whether in your case would be possible to run the GPU and the DSP concurrently, together they could calculate two pixels in 30 cycles.

 

 

1 hour ago, 42bs said:

Anyway, after Outline Demo Party I will release sources. And maybe I made some major things wrong and someone can point me where ;-)

great

Link to comment
Share on other sites

24 minutes ago, Cyprian said:

I wonder whether in your case would be possible to run the GPU and the DSP concurrently, together they could calculate two pixels in 30 cycles.

In the intros I am currently working on rather not. But did this with the Mandelbrot set.

What really is important is to "stop #$2000" the 68k, esp. if it runs in ROM space. It pollutes "The One Bus" (c) Mike Brent

Quote

The biggest thing to remember about the Jaguar, and you must remember this in all your theory, is that there is only one bus - the One Bus.

 

Link to comment
Share on other sites

If your code does this:

loop1:
    wait for blitter not busy
    gpu prepares line
    gpu uses blitter to write line to ram
    goto loop1

 

Then it will be faster if you do this with double buffering:

loop2:
    gpu prepares 1st line
    gpu uses blitter to write 1st line to ram
    gpu prepares 2nd line
    gpu uses blitter to write 2nd line to ram
    goto loop2

You stated the blitter will always finish before the gpu prepares a line.


The advantage in loop2 is that the gpu never waits for the blitter to finish
so it gets a head start preparing the next line instead of waiting.
 

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...