+khanivore Posted January 9 Share Posted January 9 8 minutes ago, TheBF said: ; movhi-451 ; OP0 : (mem/c:HI (plus:HI (reg/f:HI 10 r10) ; (const_int 2 [0x2])) [4 %sfp+2 S2 A16])code=[mem:HI] ; OP1 : (reg:HI 2 r2)code=[reg:HI] As an outside observer, lay person, I am guessing that this is the language used to program what code GCC emits. ? (I am enjoying watching the show. Thanks) (PS don't tell me Forth is cryptic anymore) Yeah, these are RTL printouts, pretty cryptic alright. I think RTL is based on lisp. GCC matches the RTL against "predicates" in the machine description file which then emit the opcodes. 1 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted January 9 Share Posted January 9 1 minute ago, khanivore said: Yeah, these are RTL printouts, pretty cryptic alright. I think RTL is based on lisp. GCC matches the RTL against "predicates" in the machine description file which then emit the opcodes. Interesting, LISP. Yes now I see it. Ok so all I need to do is make an RPN version of RTL! 🤣 1 1 Quote Link to comment Share on other sites More sharing options...
+khanivore Posted January 9 Share Posted January 9 18 minutes ago, Tursi said: Interestingly.. the constant changed when I changed all the local variables to ints (which I wanted to do just for cleaner code, but of course it doesn't affect the outer loop with this optimization.) Instead of >9120, it added >90E0, exactly 64 less. I don't know if that's a red herring or meaningful though Weird. It's as if it's emitting an unitialised var but I don't see where from the code. (edit: or is it using the address of shr plus an offset to be a more optimal way of finding the end of the loop? In which case resizing vars would move addresses around) Quote Link to comment Share on other sites More sharing options...
+khanivore Posted January 9 Share Posted January 9 Here is a proposed fix both the movqi and andhi3 issue. It's true that AFAWK this only affects register-to-register operations. But I'd like to test that theory with an assert if its wrong rather than quietly do the wrong thing. So I haven't included the additional constraint yet. This function does a number of checks to see if it can do a swap before hand and if not indicates to the caller that a swap must be done afterwards: /* * In the case that src operand is a reg and has an offset, then it is either a * QI that should have been truncated from a HI or a HI that should have been * extended from a QI. We try to correct this in the source operand if we can. * Return false if no correction required or if correction has been applied to * source op. Return true if correction is needed to dest operand after * operation. */ bool tms9900_correct_byte_order (rtx insn, rtx operands[]) { /* If the source operand is not a register or does not have an offset then * no action required */ if (!REG_P (operands[1]) || REG_OFFSET (operands[1]) == 0) return false; /* If the source register is not the original register, and the original is a * mem expression, then the offset refers to something else, so we can ignore * this case */ if (ORIGINAL_REGNO (operands[1]) != REGNO (operands[1]) && REG_EXPR (operands[1])) return false; /* We have determined that the byte order in operands[1] is wrong. If * the operands[1] register dies in this insn or if the target operands[0] * has the same register number, then we can emit swpb for the source */ if (REGNO (operands[1]) == REGNO (operands[0]) || find_regno_note (insn, REG_DEAD, REGNO (operands[1]))) { output_asm_insn ("swpb %1 ; subreg offset correction", operands); return false; } /* We need to swap after the operation but the destination is not a register. We * don't know how to handle this, so just asser */ if (!REG_P (operands[0])) gcc_unreachable(); /* Correction is required but cannot be applied to source. Return true so * caller can apply to dest */ return true; } Then this can be called in the md file : @@ -290,16 +293,12 @@ } else { - /* If src is a reg and has an offset, then it has been downgraded from a - * HI, so move the entire 16-bit word and do a byte swap */ + bool postByteSwap = tms9900_correct_byte_order (insn, operands); - if (REG_P (operands[1]) && REG_OFFSET (operands[1]) == 1) - { - output_asm_insn ("mov %1, %0", operands); + output_asm_insn ("movb %1, %0", operands); + + if (postByteSwap) output_asm_insn ("swpb %0", operands); - } - else - output_asm_insn ("movb %1, %0", operands); return ""; } and here: @@ -1318,13 +1378,7 @@ * this issue so far. */ - // If op[1] has an offset of -1 then it came from a paradoxical subreg and - // the combiner has eliminated an extend insn and the byte is in the wrong - // place. Emit a swpb to fix it - - if (REG_OFFSET (operands[1]) == -1) - output_asm_insn("swpb %0 ; subreg seen", operands); - + tms9900_correct_byte_order (insn, operands); if(which_alternative >= 4) { And can be added in other places as well as needed, or everywhere to be paranoid. Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 10 Share Posted January 10 9 hours ago, khanivore said: Weird. It's as if it's emitting an unitialised var but I don't see where from the code. (edit: or is it using the address of shr plus an offset to be a more optimal way of finding the end of the loop? In which case resizing vars would move addresses around) I'm pretty sure that's what it wants to be doing - I have seen other loops counting up that way and the code would work that way if the constant was correct. But those other ones usually initialized a register with the end address at the top and worked correctly. I'll give this patch a go and see if it works for the other issue, but if it crashes I might not feel like tracking it down again, this one feels solved for the most part. Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 10 Share Posted January 10 (edited) It was not good It took me longer than it should have to see why, I was totally looking at the wrong area. But it looks like most of the byte movs were lost. Ultimately, the easiest one to see was in vdpmemcpy, but I was seeing them elsewhere. void vdpmemcpy(int pAddr, const unsigned char *pSrc, int cnt) { VDP_SET_ADDRESS_WRITE(pAddr); while (cnt--) { VDPWD=*(pSrc++); } } pseg even def vdpmemcpy vdpmemcpy * Begin inline assembler code * 89 "../vdp.h" 1 swpb r1 set VDP address movb r1,@>8c02 swpb r1 ori r1,>4000 movb r1,@>8c02 andi r1,>3fff * 0 "" 2 * End of inline assembler code mov r3, r0 copy count to r0 jeq L4 if it's zero, early out clr r1 clear index (no longer VDP address) L3 mov r2, r4 get source address a r1, r4 add index mov *r4, @>8C00 move calculated word from RAM to VDP <<<--- bug - must be movb inc r1 increment index c r1, r3 check for end of loop jne L3 if not there, loop around L4 b *r11 .size vdpmemcpy, .-vdpmemcpy Since you can not mov 16-bit from an odd address, only even, every byte in the copy ends up duplicated. Unless I missed a part! But pretty sure I applied the patch as shown. Edited January 10 by Tursi Quote Link to comment Share on other sites More sharing options...
+khanivore Posted January 10 Share Posted January 10 15 hours ago, Tursi said: I'll give this patch a go and see if it works for the other issue, but if it crashes I might not feel like tracking it down again, this one feels solved for the most part. Fully understood and thanks for testing these out. I'll keep testing here as well. If I can't get the generic solution to work I'll go back to the point solution with the extra constraint. 14 hours ago, Tursi said: Since you can not mov 16-bit from an odd address, only even, every byte in the copy ends up duplicated. Unless I missed a part! But pretty sure I applied the patch as shown. It shows up as MOVB in my output but I may have botched the patch. I made a few other changes too and was trying to just isolate that one but that may have messed up line numbering. Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 12 Share Posted January 12 I did a little more work on my random constant -- I added debug to say which line was emitting it, and it's the one in sub_const_hi. Can't make much sense of it beyond that. shr is 9 ints at >2284 in this test, and the value added was >90E0. Flipping it negative, adding or subtracting, none of the numbers mean much. Where are loop optimizations calculated? ; movhi-449 ; OP0 : (mem/c:HI (plus:HI (reg/f:HI 10 r10) ; (const_int 2 [0x2])) [4 %sfp+2 S2 A16])code=[mem:HI] ; OP1 : (reg:HI 2 r2)code=[reg:HI] ; movhi alt=2 mov r2, @>2(r10) ai r4, >90E0 ; subconsthi jeq JMP_11 b @L49 JMP_11 Not much of a clue but I wanted to save it somewhere. Quote Link to comment Share on other sites More sharing options...
+khanivore Posted January 12 Share Posted January 12 6 hours ago, Tursi said: I did a little more work on my random constant -- I added debug to say which line was emitting it, and it's the one in sub_const_hi. Can't make much sense of it beyond that. shr is 9 ints at >2284 in this test, and the value added was >90E0. Flipping it negative, adding or subtracting, none of the numbers mean much. Where are loop optimizations calculated? I'm afraid I have no clue. Deep in the bowels of GCC somewhere. Though the value is probably based on bad data we are somehow passing. Is the full code for this online anywhere? I tried to build the snippet above but it didn't create anything meaningful. Quote Link to comment Share on other sites More sharing options...
JasonACT Posted January 13 Share Posted January 13 (edited) 14 hours ago, Tursi said: I did a little more work on my random constant -- I added debug to say which line was emitting it, and it's the one in sub_const_hi. 7 hours ago, khanivore said: Is the full code for this online anywhere? I tried to build the snippet above but it didn't create anything meaningful. I tried to recreate it, with a couple of versions of GCC, but mine comes back with "AI R1,FFE7" at that point, and only if I use "-Os" - any other optimisation level produces completely different code. This seems like a reasonable value to me, and we will probably need to see the actual code to go any further. Spoiler #define uint8 unsigned char #define NUM_SHOTS 9 #define PLAYER_SHOT 16 #define ENEMY_ENGINE 32 #define ENEMY_SPRITE 8 #define ENEMY_NONE 1 #define VDP_REG_CT 0x03C0 #define PAL_BOSS 2 int f18a; int f18WhitePalette; int pwrlvl; int shr [NUM_SHOTS+1]; int shc [NUM_SHOTS+1]; int ep [4]; int enr [4]; int ent [4]; int damage [8]; int bosscnt; int checkdamage (int, int, int); void playsfx_hitboss (); void spdel (int); void spposn (int, int, int); int abs (int); void loadpal_f18a(int, int, int); void VDP_SET_REGISTER (int, int); void playsfx_explosion (); void addscore (int); void playsfx_armor (); void whoded() { unsigned char rd,cd; unsigned char r,c; /*check boss specific collisions*/ for (uint8 a=0; a<NUM_SHOTS; a++) { // check for valid shot if (!shr[a]) continue; // note: in the test case, ALL VALUES are 0 and continue always executes. // check if hit a piece of boss if (checkdamage(shr[a], shc[a], pwrlvl&0x0f)) { playsfx_hitboss(); spdel(a+PLAYER_SHOT); shr[a]=0; continue; } // check if hit an engine for (uint8 b=0; b<3; ++b) { if (ent[b] != ENEMY_ENGINE) continue; spposn(b+ENEMY_SPRITE, r, c); rd = abs(r-shr[a]); if (rd <= 20) { cd = abs(c-shc[a]); if (cd <= 15) { if (f18a) { loadpal_f18a(f18WhitePalette, PAL_BOSS*4, 16); // set boss palette to white } else { VDP_SET_REGISTER(VDP_REG_CT, 0x0f); // CT at >03C0 (makes boss flash white, everything else already is) } bosscnt=3; // how many cycles to stay white (should be 3 frames per cycle) ep[b]-=damage[pwrlvl&0x07]; if (ep[b]<=0) { playsfx_explosion(); addscore(5); ep[b]=0; enr[b]=192; ent[b]=ENEMY_NONE; spdel(b+ENEMY_SPRITE); } else { playsfx_armor(); } spdel(a+PLAYER_SHOT); shr[a]=0; break; } } } } } Edited January 13 by JasonACT Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 13 Share Posted January 13 5 hours ago, JasonACT said: I tried to recreate it, with a couple of versions of GCC, but mine comes back with "AI R1,FFE7" at that point, and only if I use "-Os" - any other optimisation level produces completely different code. This seems like a reasonable value to me, and we will probably need to see the actual code to go any further. Reveal hidden contents #define uint8 unsigned char #define NUM_SHOTS 9 #define PLAYER_SHOT 16 #define ENEMY_ENGINE 32 #define ENEMY_SPRITE 8 #define ENEMY_NONE 1 #define VDP_REG_CT 0x03C0 #define PAL_BOSS 2 int f18a; int f18WhitePalette; int pwrlvl; int shr [NUM_SHOTS+1]; int shc [NUM_SHOTS+1]; int ep [4]; int enr [4]; int ent [4]; int damage [8]; int bosscnt; int checkdamage (int, int, int); void playsfx_hitboss (); void spdel (int); void spposn (int, int, int); int abs (int); void loadpal_f18a(int, int, int); void VDP_SET_REGISTER (int, int); void playsfx_explosion (); void addscore (int); void playsfx_armor (); void whoded() { unsigned char rd,cd; unsigned char r,c; /*check boss specific collisions*/ for (uint8 a=0; a<NUM_SHOTS; a++) { // check for valid shot if (!shr[a]) continue; // note: in the test case, ALL VALUES are 0 and continue always executes. // check if hit a piece of boss if (checkdamage(shr[a], shc[a], pwrlvl&0x0f)) { playsfx_hitboss(); spdel(a+PLAYER_SHOT); shr[a]=0; continue; } // check if hit an engine for (uint8 b=0; b<3; ++b) { if (ent[b] != ENEMY_ENGINE) continue; spposn(b+ENEMY_SPRITE, r, c); rd = abs(r-shr[a]); if (rd <= 20) { cd = abs(c-shc[a]); if (cd <= 15) { if (f18a) { loadpal_f18a(f18WhitePalette, PAL_BOSS*4, 16); // set boss palette to white } else { VDP_SET_REGISTER(VDP_REG_CT, 0x0f); // CT at >03C0 (makes boss flash white, everything else already is) } bosscnt=3; // how many cycles to stay white (should be 3 frames per cycle) ep[b]-=damage[pwrlvl&0x07]; if (ep[b]<=0) { playsfx_explosion(); addscore(5); ep[b]=0; enr[b]=192; ent[b]=ENEMY_NONE; spdel(b+ENEMY_SPRITE); } else { playsfx_armor(); } spdel(a+PLAYER_SHOT); shr[a]=0; break; } } } } } The correct size is 18 bytes, so not quite, but certainly much closer. The code's not out there for this one yet... Looks like I /am/ using Os, since I decided size mattered most as I was trying to squeeze 16k blocks into 8k packages. I was thinking of turning off individual loop optimizations if I could to try and isolate which one causes it. Good data though! 1 Quote Link to comment Share on other sites More sharing options...
JasonACT Posted January 13 Share Posted January 13 OK, I see it, AI R2,>9140 #define uint8 unsigned char #define NUM_SHOTS 9 int f18a; int shc [NUM_SHOTS]; int ep [3]; void spposn (int); void VDP_SET_REGISTER (); void whoded() { for (uint8 a=0; a<NUM_SHOTS; a++) { for (uint8 b=0; b<3; ++b) { spposn(b); if (shc[a] == 0) { if (f18a) { VDP_SET_REGISTER(); } else { VDP_SET_REGISTER(); } ep[b]-=1; break; } } } } 1 Quote Link to comment Share on other sites More sharing options...
JasonACT Posted January 13 Share Posted January 13 Seems *sub_const_hi has "operands[2] = GEN_INT(-INTVAL(operands[2]));" which takes 18 and produces -51080896 (F...FCF49140) - which is why I see >9140 Quote Link to comment Share on other sites More sharing options...
JasonACT Posted January 13 Share Posted January 13 Well, it's a constant, but not a constant-int - it's an offset from "shc" and it is being correctly generated, no not a GCC bug. (define_insn "*sub_const_hi" [(set (match_operand:HI 0 "register_operand" "=r,r") (plus:HI (match_operand:HI 1 "register_operand" "0,0") (neg:HI (match_operand:HI 2 "immediate_operand" "LMNO, i"))))] "" { tms9900_debug_operands ("sub_const_hi", operands, 3); if (GET_CODE (operands[2]) == CONST_INT) { operands[2] = GEN_INT(-INTVAL(operands[2])); switch(INTVAL(operands[2])) { case -2: output_asm_insn("dect %0", operands); break; case -1: output_asm_insn("dec %0", operands); break; case 0: break; case 1: output_asm_insn("inc %0", operands); break; case 2: output_asm_insn("inct %0", operands); break; default: output_asm_insn("ai %0, %2", operands); break; } } else { output_asm_insn("li r0, %2", operands); output_asm_insn("s r0, %0", operands); } return(""); } [(set_attr "length" "2,4")]) I've added case 0, which seemed to be missing, along with a work around using R0 for the variable offset. I've not tested it, but it does assemble (at least). Not sure how to make that length:4 a 6 here Quote Link to comment Share on other sites More sharing options...
+khanivore Posted January 13 Share Posted January 13 3 hours ago, JasonACT said: Well, it's a constant, but not a constant-int - it's an offset from "shc" and it is being correctly generated, no not a GCC bug. Oh wow, good catch! 3 hours ago, JasonACT said: I've added case 0, which seemed to be missing, along with a work around using R0 for the variable offset. I've not tested it, but it does assemble (at least). Not sure how to make that length:4 a 6 here Looks like constraint 'i' is "An immediate integer operand (one with constant value) is allowed. This includes symbolic constants whose values will be known only at assembly time or later." whereas 'n' is "An immediate integer operand with a known numeric value is allowed." so we could add an extra alternative to differentiate between labels and numbers: "LMN, n, i " The case for 0 shouldn't be needed but we can add a constraint O to catch this as well. That should fix up the lengths. Quote Link to comment Share on other sites More sharing options...
+khanivore Posted January 13 Share Posted January 13 8 hours ago, Tursi said: Looks like I /am/ using Os That showed up the bug for me as well - thanks. Didn't show up with -O2. Quote Link to comment Share on other sites More sharing options...
JasonACT Posted January 13 Share Posted January 13 7 hours ago, khanivore said: The case for 0 shouldn't be needed but we can add a constraint O to catch this as well. That should fix up the lengths. I was wondering, is there a case where it'll subtract 0 for the implicit compare to 0? Quote Link to comment Share on other sites More sharing options...
+khanivore Posted January 13 Share Posted January 13 37 minutes ago, JasonACT said: I was wondering, is there a case where it'll subtract 0 for the implicit compare to 0? Not that I can think of. GCC will explicitly ask for a compare insn unless we tell it an insn includes an implicit compare (which we can only do in gcc13). 2 Quote Link to comment Share on other sites More sharing options...
JasonACT Posted January 14 Share Posted January 14 2 hours ago, khanivore said: which we can only do in gcc13 Speaking of gcc13, I had to remove your "builtin_define_std ("cpu=tms9900");" (from tms9900.h) because for some reason my build produces a warning for every compile I do that "__cpu" from <built-in> is redefined, and the location of the previous define is <built-in>. Also, I noticed while looking to fix that, my install directory is 900MB so I won't be able to distribute my build of '13, it's too big. Finally, for reasons I don't understand, my build of '13 is linking to .dlls that are part of the Windows gcc suite I'm using - that doesn't happen with gcc4 - and is what I was relying on to make a standalone Windows tms9900-gcc. It does produce much better looking code though. 1 Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 14 Share Posted January 14 (edited) Wow, @JasonACT, that's awesome. I don't fully understand the fix, but fix it did. With your patch, that code looks like this on my side, which makes sense to me: ; sub_const_hi-452 ; OP0 : (reg:HI 4 r4)code=[reg:HI] ; OP1 : (reg:HI 4 r4)code=[reg:HI] ; OP2 : (const:HI (plus:HI (symbol_ref:HI ("shr") [flags 0x40] <var_decl 0x7f2a8b211dc0 shr>) ; (const_int 18 [0x12])))code=[const:HI] li r0, shr+18 s r0, r4 jeq JMP_11 b @L49 JMP_11 If it's an immediate we could optimize it to AI, but I'm pretty thrilled that it works. I'll run through everything else I can and see if there's anything else breaking in my code. (EDIT: Ah, I see you did comment on that already... guess we would need to know how to detect that "shr+18" assembles to a constant. It's not awful though!) (edit2: added the RTL) Edited January 14 by Tursi 3 Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 14 Share Posted January 14 I was able to run both sample apps and play through the entire game successfully. There are still a couple of weird glitches to check out, but as with all of them they are equally likely my fault until I figure out what they are. (And they are going to be tricky... one enemy sometimes likes to cling to the top of the screen and some of the enemy bullets are way too fast). But I'm tickled with the progress, this is really nice to see. I have more confidence in this compiler by the day! 5 Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 14 Share Posted January 14 (edited) let me fix the obvious stuff on my side before reposting this Edited January 14 by Tursi Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 14 Share Posted January 14 All issues seem resolved for now! My two bugs - one was as programmed, and the other one was promotion to unsigned int making it ignore the need to call ABS (also my fault). The second one took me a while because I thought it was ignoring a comparison, but it was like this: if (z>8) { if (z>2) { option 1 } else { option 2 } } it took me far longer to figure out that it wasn't doing the second comparison or emitting option 2 because it was smart enough to see it could never happen A little more debug to do, but I can finally progress with new features in the game. Thanks! 4 Quote Link to comment Share on other sites More sharing options...
+khanivore Posted January 14 Share Posted January 14 8 hours ago, JasonACT said: Speaking of gcc13, I had to remove your "builtin_define_std ("cpu=tms9900");" (from tms9900.h) because for some reason my build produces a warning for every compile I do that "__cpu" from <built-in> is redefined, and the location of the previous define is <built-in>. Also, I noticed while looking to fix that, my install directory is 900MB so I won't be able to distribute my build of '13, it's too big. Finally, for reasons I don't understand, my build of '13 is linking to .dlls that are part of the Windows gcc suite I'm using - that doesn't happen with gcc4 - and is what I was relying on to make a standalone Windows tms9900-gcc. It does produce much better looking code though. Yeah I noticed that. I was ignoring it for now since it was just a warning. There is still quite a lot of work left in gcc13 including updating libgcc and bintools. I did a trial build of a .deb and it wasn't as big as 900MB but was well over twice the size of the gcc4 package from what I remember. 1 Quote Link to comment Share on other sites More sharing options...
+khanivore Posted January 17 Share Posted January 17 On 1/14/2024 at 2:33 AM, Tursi said: ; sub_const_hi-452 ; OP0 : (reg:HI 4 r4)code=[reg:HI] ; OP1 : (reg:HI 4 r4)code=[reg:HI] ; OP2 : (const:HI (plus:HI (symbol_ref:HI ("shr") [flags 0x40] <var_decl 0x7f2a8b211dc0 shr>) ; (const_int 18 [0x12])))code=[const:HI] li r0, shr+18 s r0, r4 jeq JMP_11 b @L49 JMP_11 If it's an immediate we could optimize it to AI, but I'm pretty thrilled that it works. I'll run through everything else I can and see if there's anything else breaking in my code. (EDIT: Ah, I see you did comment on that already... guess we would need to know how to detect that "shr+18" assembles to a constant. It's not awful though!) (edit2: added the RTL) Returning to this, hoping to get a new release out today or tomorrow. It seems by looking through the RTL dumps (-da command line switch) that the sub_const_hi insn was added by the peephole pass to optimise a comparison when the compare value is +/- 1 or 2. Unfortunately, it has replaced a CI with LI and S which is less efficient. If I just delete that peephole and the sub_const_hi insn then it emits CI r2,sch+18 which is much better. 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.