matmook Posted January 18, 2021 Share Posted January 18, 2021 Hi! I was just wondering if there is a list if GPU good practices... (SCPCD knows I think ) I know some basic code interlacing rules but I'm not sure for loadp/storep, load/store using r14+/r15+, ... what to avoid and what to use instead... Thanks! Quote Link to comment Share on other sites More sharing options...
swapd0 Posted January 18, 2021 Share Posted January 18, 2021 Maybe... test any change on real hardware. Quote Link to comment Share on other sites More sharing options...
Cyprian Posted January 20, 2021 Share Posted January 20, 2021 I also would be interested. e.g. how to deal with LOADP / STOREP and it's "High Long Word Register" Quote High Long Word Register There is no scoreboard protection for the GPU high long word register. This causes various problems. If doing successive STOREP instructions, there is no way of telling when one has completed so that the high data can be loaded for the next one, this has the effect that successive STOREP instructions are really only useful when they write the same data. All external loads will modify this register, so that an interrupt which performs external loads will corrupt the high data from an underlying LOADP instruction, and there is no way for the interrupt service routine to preserve this data Quote Link to comment Share on other sites More sharing options...
SCPCD Posted January 20, 2021 Share Posted January 20, 2021 There are some tips on my website that I use to make the ST2Jag optimization here : http://scpcd.free.fr/jag/jag.htm#ST2Jag It should be neer 99% true, from what I remember. For the ST2Jag exemple : - First column describes what is done in the Read cycle as R[register number] - Second column describes what is done in the Compute cycle as C[register number] and the parrellel memory controller current task as "M" (external memory read), "I" (internal memory read) and "R" (GPU register range) - Third column describes what is done in the Write cycle as W[register number] For External Memory LOAD(B.W.P), it will depend of bus usage but a good approximation is arround 10 cycles (for the ST2Jag, I use 12cycles to be pessimistic). For the "High Long Word register", there is also an exemple in the ST2Jag code If you would like to use loadp/storep, you can't made other load instruction to external memory as it will trash the "high long word register". It will be effectively neer imposible to use it if there is external load in GPU interrupt routine. But, as you can see in my code, you can insert load instruction if it only reads in internal memory. For storep, you effectively can't do something like : storep r0, (r1) store r2, (high_word_register) nop storep r0, (r3) In this case, the high_word_register can be updated by the second instruction before the memory controller has latched the data and write to the r1 memory address : this will depend of the memory controller curent state and bus activities. To avoid this, you can made one of the following : - insert enough instruction between the first and the second instruction but it will be difficult to have something reliable as it depends of the bus activity storep r0, (r1) nop nop nop nop nop nop nop nop nop nop store r2, (high_word_register) nop storep r0, (r3) - or : add an external load or store instruction between the first and the second instruction : when the instruction will arrive to the second storep, you will have the assurance that the first storep is completed because the added load/store will trigg a gpu wait_state as the memory controller is in "work in progress" state. storep r0, (r1) store r4, (somewhere_in_external_memory) store r2, (high_word_register) nop storep r0, (r3) For load/store R14(5)+, those are usefull, but at the cost of an extra (wait_state) cycle. In the ST2Jag exemple, you will see that I replace them by standard load instruction to give me more reordering possibilities and increase instruction pipelining. But It will probably depends of the registers availabilities and algorithms. 4 Quote Link to comment Share on other sites More sharing options...
matmook Posted January 20, 2021 Author Share Posted January 20, 2021 Thanks SCPCD !! So the benefit of loadp/storep (if you can place/interleave them) is to avoid loosing, a second time, 10 extra cycles to access external memory when using 2 x load/store. Great! For R14(5)+, I don't see the benefit. I know you are right but my brain doesn't want to understand that... Let's take an example: Load (14+1), r0 Load (r14+2), r1 If I want to avoid those R14(5)+, can I do: Load (r14), r0 Nop (or some other code to not touch r14) Addq #4, r14 Nop (or some other code to be sure r14 is updated) Load (r14), r1 I will avoid extra wait_states but add new instructions, how could it be faster in that case? Oo 1 Quote Link to comment Share on other sites More sharing options...
+CyranoJ Posted January 20, 2021 Share Posted January 20, 2021 3 minutes ago, matmook said: I will avoid extra wait_states but add new instructions, how could it be faster in that case? Oo For the original code... it won't be. However, those new instructions can prepare data or registers for the next chunk of code. Quote Link to comment Share on other sites More sharing options...
matmook Posted January 20, 2021 Author Share Posted January 20, 2021 1 minute ago, CyranoJ said: For the original code... it won't be. However, those new instructions can prepare data or registers for the next chunk of code. Okay for the "Nop (or some other code)" (interleaving). But I also added "addq #4, r14" and for me it means more cycles used...? Quote Link to comment Share on other sites More sharing options...
SCPCD Posted January 23, 2021 Share Posted January 23, 2021 Indexed and offset load/store take 2 more cycle (as wait_states) than standard load/store. Remplacing it with standard load/store and using addq can be more efficiency as you can rearange opcodes to avoid as much wait_states as possible (at least 1 for each load/store) : The idea is to replace the 2 wait_states by 1 addq and 1 another useful instruction. Quote Link to comment Share on other sites More sharing options...
matmook Posted January 23, 2021 Author Share Posted January 23, 2021 9 minutes ago, SCPCD said: Indexed and offset load/store take 2 more cycle (as wait_states) than standard load/store. Remplacing it with standard load/store and using addq can be more efficiency as you can rearange opcodes to avoid as much wait_states as possible (at least 1 for each load/store) : The idea is to replace the 2 wait_states by 1 addq and 1 another useful instruction. Okay, so if it's more efficient, it means 2 wait_states take more time than 1 addq and 1 another useful instruction (wait_state <> cycle) ? Quote Link to comment Share on other sites More sharing options...
swapd0 Posted January 23, 2021 Share Posted January 23, 2021 (edited) No, it takes the same time but you are doing an extra instruction for free but you are wasting 4 bytes (for addq and extra instruction). Edited January 23, 2021 by swapd0 1 Quote Link to comment Share on other sites More sharing options...
matmook Posted January 23, 2021 Author Share Posted January 23, 2021 1 hour ago, swapd0 said: No, it takes the same time but you are doing an extra instruction for free but you are wasting 4 bytes (for addq and extra instruction). Okay, in this case that's interesting (got a very big loop with some of them so.. ). Thanks! Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.