Jump to content
IGNORED

GPU good practices


matmook

Recommended Posts

I also would be interested. e.g. how to deal with LOADP / STOREP  and it's "High Long Word Register"

Quote

High Long Word Register
There is no scoreboard protection for the GPU high long word register. This causes various
problems. If doing successive STOREP instructions, there is no way of telling when one has
completed so that the high data can be loaded for the next one, this has the effect that
successive STOREP instructions are really only useful when they write the same data. All
external loads will modify this register, so that an interrupt which performs external loads will
corrupt the high data from an underlying LOADP instruction, and there is no way for the
interrupt service routine to preserve this data

 

Link to comment
Share on other sites

There are some tips on my website that I use to make the ST2Jag optimization here : http://scpcd.free.fr/jag/jag.htm#ST2Jag

It should be neer 99% true, from what I remember.

 

For the ST2Jag exemple :

- First column describes what is done in the Read cycle as R[register number]

- Second column describes what is done in the Compute cycle as C[register number] and the parrellel memory controller current task as "M" (external memory read), "I" (internal memory read) and "R" (GPU register range)

- Third column describes what is done in the Write cycle as W[register number]

 

For External Memory LOAD(B.W.P), it will depend of bus usage but a good approximation is arround 10 cycles (for the ST2Jag, I use 12cycles to be pessimistic).

 

 

For the "High Long Word register", there is also an exemple in the ST2Jag code :)

If you would like to use loadp/storep, you can't made other load instruction to external memory as it will trash the "high long word register". It will be effectively neer imposible to use it if there is external load in GPU  interrupt routine.

But, as you can see in my code, you can insert load instruction if it only reads in internal memory.

 

 

For storep, you effectively can't do something like :

storep r0, (r1)
store  r2, (high_word_register)
nop
storep r0, (r3)

In this case, the high_word_register can be updated by the second instruction before the memory controller has latched the data and write to the r1 memory address : this will depend of the memory controller curent state and bus activities.


To avoid this, you can made one of the following :

- insert enough instruction between the first and the second instruction but it will be difficult to have something reliable as it depends of the bus activity

storep r0, (r1)
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
store  r2, (high_word_register)
nop
storep r0, (r3)

- or : add an external load or store instruction between the first and the second instruction : when the instruction will arrive to the second storep, you will have the assurance that the first storep is completed because the added load/store will trigg a gpu wait_state as the memory controller is in "work in progress" state.

 

storep r0, (r1)
store  r4, (somewhere_in_external_memory)
store  r2, (high_word_register)
nop
storep r0, (r3)

 

 

For load/store R14(5)+, those are usefull, but at the cost of an extra (wait_state) cycle.

In the ST2Jag exemple, you will see that I replace them by standard load instruction to give me more reordering possibilities and increase instruction pipelining.

But It will probably depends of the registers availabilities and algorithms.

 

 

  • Like 4
Link to comment
Share on other sites

Thanks SCPCD !! :)

 

So the benefit of loadp/storep (if you can place/interleave them) is to avoid loosing, a second time, 10 extra cycles to access external memory when using 2 x load/store.

 

Great! 

 

For R14(5)+, I don't see the benefit. I know you are right but my brain doesn't want to understand that...

 

Let's take an example:

Load (14+1), r0

Load (r14+2), r1

 

If I want to avoid those R14(5)+, can I do:

Load (r14), r0

Nop (or some other code to not touch r14)

Addq #4, r14

Nop (or some other code to be sure r14 is updated)

Load (r14), r1

 

I will avoid extra wait_states but add new instructions, how could it be faster in that case? Oo

 

 

 

  • Like 1
Link to comment
Share on other sites

3 minutes ago, matmook said:

I will avoid extra wait_states but add new instructions, how could it be faster in that case? Oo

For the original code... it won't be.  However, those new instructions can prepare data or registers for the next chunk of code.

Link to comment
Share on other sites

1 minute ago, CyranoJ said:

For the original code... it won't be.  However, those new instructions can prepare data or registers for the next chunk of code.

Okay for the "Nop (or some other code)" (interleaving). But I also added "addq #4, r14" and for me it means more cycles used...?

 

Link to comment
Share on other sites

Indexed and offset load/store take 2 more cycle (as wait_states) than standard load/store.

 

Remplacing it with standard load/store and using addq can be more efficiency as you can rearange opcodes to avoid as much wait_states as possible (at least 1 for each load/store) :

The idea is to replace the 2 wait_states by 1 addq and 1 another useful instruction.

 

 

Link to comment
Share on other sites

9 minutes ago, SCPCD said:

Indexed and offset load/store take 2 more cycle (as wait_states) than standard load/store.

 

Remplacing it with standard load/store and using addq can be more efficiency as you can rearange opcodes to avoid as much wait_states as possible (at least 1 for each load/store) :

The idea is to replace the 2 wait_states by 1 addq and 1 another useful instruction.

 

 

Okay, so if it's more efficient, it means 2 wait_states take more time than 1 addq and 1 another useful instruction (wait_state <> cycle) ?

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...