+khanivore Posted January 17 Share Posted January 17 On 12/24/2023 at 9:42 PM, Tursi said: aed4: sb r2,r1 subtract r2 from r1 again neg r1 but negate (not invert!) the result so the sign is right <--- BUG because the LSB is undefined Regarding this one, I tested removing the alternate constraints and instead of this: sb r2, r1 neg r1 the compiler emitted this: sb r1, r2 which seems much better to me, so I'm going to just delete the alternate set of constraints that support swapped operands. 2 Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 18 Share Posted January 18 18 hours ago, khanivore said: Regarding this one, I tested removing the alternate constraints and instead of this: sb r2, r1 neg r1 the compiler emitted this: sb r1, r2 which seems much better to me, so I'm going to just delete the alternate set of constraints that support swapped operands. Nice... I hope it all works. I don't have my test cases nicely isolated anymore so if it breaks investigating will start from scratch again. Quote Link to comment Share on other sites More sharing options...
JasonACT Posted January 20 Share Posted January 20 On 1/18/2024 at 12:05 AM, khanivore said: Returning to this, hoping to get a new release out today or tomorrow. It seems by looking through the RTL dumps (-da command line switch) that the sub_const_hi insn was added by the peephole pass to optimise a comparison when the compare value is +/- 1 or 2. Unfortunately, it has replaced a CI with LI and S which is less efficient. If I just delete that peephole and the sub_const_hi insn then it emits CI r2,sch+18 which is much better. Do I delete the "Optimizations for Comparisons" section including both X == {-2,-1,1,2} and also X != {-2,-1,1,2} (along with the insn)? Well, I tried that anyway, and you're right - it is better on at least the example we're testing... It would be nice to enhance the peephole though, if possible, to only work on those 4 integer constants, otherwise go with a CI. Do you happen to know how the peephole was adding that sub? I know they are related, I've looked at the RTL dumps, but I'm not really following how the comparison became a sub. On 1/18/2024 at 2:05 AM, khanivore said: Regarding this one, I tested removing the alternate constraints and instead of this: sb r2, r1 neg r1 the compiler emitted this: sb r1, r2 which seems much better to me, so I'm going to just delete the alternate set of constraints that support swapped operands. I'm not sure I know enough about the .md file to try this without seeing your edits. It looks really good though Quote Link to comment Share on other sites More sharing options...
+khanivore Posted January 20 Share Posted January 20 On 1/18/2024 at 9:37 AM, Tursi said: Nice... I hope it all works. I don't have my test cases nicely isolated anymore so if it breaks investigating will start from scratch again. I'm maintaining a set of tests and trying to capture each failed case as it's reported. I can't guarantee it will catch everything but hopeful will reduce the likelihood of a regression. 1 Quote Link to comment Share on other sites More sharing options...
+khanivore Posted January 20 Share Posted January 20 6 hours ago, JasonACT said: Do I delete the "Optimizations for Comparisons" section including both X == {-2,-1,1,2} and also X != {-2,-1,1,2} (along with the insn)? Yep, delete lines 2115-2175 in tms9900.md on main. 6 hours ago, JasonACT said: It would be nice to enhance the peephole though, if possible, to only work on those 4 integer constants, otherwise go with a CI Nothing comes to mind here. I don't see that CLR rX; DEC rX; C rX, rY is any better than LI Rx, -1; C Rx ? Maybe if would be useful to emit a table of constants somewhere with entries like DATA_FF so we could do CB Rx,@DATA_FF instead of using LI so much? 6 hours ago, JasonACT said: Do you happen to know how the peephole was adding that sub? I think the rewrite matches sub-const-hi. Peepholes, like insns, work on predicates, so I'm presuming that: (define_peephole2 [(set (cc0) (compare (match_operand:HI 0 "register_operand" "r") (match_operand:HI 1 "immediate_operand" "LMNO"))) (set (pc) (if_then_else (eq (cc0) (const_int 0)) (label_ref (match_operand 2 "" "")) (pc)))] "peep2_reg_dead_p(2, operands[0])" [(set (match_operand:HI 0 "register_operand" "") (plus:HI (match_dup 0) (neg:HI (match_dup 1)))) (set (cc0) (match_dup 0)) (set (pc) (if_then_else (eq (cc0) (const_int 0)) (label_ref (match_dup 2)) (pc)))] ) means if there is sequence of insns where a compare of a reg to a const LMNO (-2,+2,-1,+1,0) [and I'm not sure constraints actually work in peepholes, the constant we saw was NOT one of LMNO] is followed by an if_then_else and the reg_dead condition is met (the reg of operands[0] is not required again after this sequence) then replace those two insns with a 3 insns where the first is a set of operand[0] to be the plus of operand[0] and the negation of operand[1] followed by a set of the condition code (which emits the compare) followed by the if_then_else. After the predicates have been rewritten by the peephole, then the set (plus (neg)) matches the "sub_const_hi" insn so it gets emitted. At least that's my understanding of how it works. I'm debugging the peepholes a bit now but tbh many of them don't make a lot of sense to me and as far as I can tell from debug output none of them except the above are being hit. I'm kind of inclined to delete them all and add very specific targeted ones back in as cases arise where there is an obvious need. 1 Quote Link to comment Share on other sites More sharing options...
JasonACT Posted January 20 Share Posted January 20 27 minutes ago, khanivore said: [and I'm not sure constraints actually work in peepholes, the constant we saw was NOT one of LMNO] Yeah, that bit in particular was why I can't work out what my next move would be. 27 minutes ago, khanivore said: tbh many of them don't make a lot of sense to me and as far as I can tell from debug output none of them except the above are being hit. Both you and me. 27 minutes ago, khanivore said: I'm kind of inclined to delete them all and add very specific targeted ones back in as cases arise where there is an obvious need. I'll take a look tomorrow, they must have been written for something that looked useful at the time. I'll see what I can find. Thanks! 1 Quote Link to comment Share on other sites More sharing options...
JasonACT Posted January 20 Share Posted January 20 10 hours ago, khanivore said: I'm kind of inclined to delete them all and add very specific targeted ones back in as cases arise where there is an obvious need. 10 hours ago, JasonACT said: I'll take a look tomorrow, they must have been written for something that looked useful at the time. I'll see what I can find. volatile unsigned int j; volatile char b; // Compile with: -Os void test () { // ;; Optimization for (unsigned char)X < N if ((unsigned char)b < 2) j = 0; // ;; Optimization for (unsigned char)X >= N if ((unsigned char)b >= 2) j = 0; // ;; Optimization for (unsigned char)X <= N if ((unsigned char)b <= 2) j = 0; // ;; Optimization for (char)X > N if (b > 2) j = 0; // ;; Optimization for (char)X < N if (b < 2) j = 0; // ;; Optimization for (char)X >= N if (b >= 2) j = 0; // ;; Optimization for (char)X <= N if (b <= 2) j = 0; } I didn't check specifically that 1.19 was using these peepholes, I'd need to add a heap of debugging to that old version, but I think it is... Output from 1.19: Spoiler test movb @b, r2 ci r2, >1FF jh L2 clr @j L2 movb @b, r2 ci r2, >1FF jle L3 clr @j L3 movb @b, r2 ci r2, >2FF jh L4 clr @j L4 movb @b, r2 ci r2, >2FF jlt L5 jeq L5 clr @j L5 movb @b, r2 ci r2, >1FF jgt L6 clr @j L6 movb @b, r2 ci r2, >1FF jlt L7 jeq L7 clr @j L7 movb @b, r2 ci r2, >2FF jgt L9 clr @j L9 b *r11 Output from 1.28: Spoiler test movb @b, r1 li r0, >100 cb r1, r0 jh L2 clr r1 mov r1, @j L2 movb @b, r1 li r0, >100 cb r1, r0 jle L3 clr r1 mov r1, @j L3 movb @b, r1 li r0, >200 cb r1, r0 jh L4 clr r1 mov r1, @j L4 movb @b, r1 li r0, >200 cb r1, r0 jlt L5 jeq L5 clr r1 mov r1, @j L5 movb @b, r1 li r0, >100 cb r1, r0 jgt L6 clr r1 mov r1, @j L6 movb @b, r1 li r0, >100 cb r1, r0 jlt L7 jeq L7 clr r1 mov r1, @j L7 movb @b, r1 li r0, >200 cb r1, r0 jgt L9 clr r1 mov r1, @j L9 b *r11 But I think you're right, they are no longer being hit. Quote Link to comment Share on other sites More sharing options...
+khanivore Posted January 21 Share Posted January 21 On 1/20/2024 at 11:59 AM, JasonACT said: I'll take a look tomorrow, they must have been written for something that looked useful at the time. I'll see what I can find. Very interesting, thanks. It looks like the replacement of "clr r1; mov r1,@j" with "clr @j" was lost in a tidy up of the constraints for "movhi". In 1.19 the constraints O and M were allowed as targets for a dest type of Q or R. I can put these back in to reduce one insn. The "ci r2,>1ff" most likely was a peephole. Possibly no longer being matched since cmpqi has tighter constraints? But it is a clever optimisation and is neater than LI and CB. I was thinking we could alter the cmpqi output but I think it depends on the following jump condition too so probably a peephole is the only way to achieve it. I'll see if I can update the peephole to work with the latest cmpqi .... 1 Quote Link to comment Share on other sites More sharing options...
+khanivore Posted January 26 Share Posted January 26 Looking at the generated code, while gcc normally calls a function like this: bl @puts If you optimise using -O2 or -Os it decides to use an intermediate register for some reason: li r2, puts bl *r2 which is generally worse not better. It might be slightly better if the load was outside a loop but the compiler rarely does the load outside the loop from what I've seen. Even worse, when it runs short of registers it spills the reg to the stack resulting in this absolute mess: li r2, testColorText mov r2, @>4(r10) ... mov @>4(r10), r2 bl *r2 I've been desperately trying to find a way to stop it from storing labels to registers before doing a branch or call and I just can't. I added cost functions that made calls to registers prohibitively expensive but they made no difference. I tried a peephole which sometimes worked but it crashed when I encountered cases like above where the label had been pushed to the stack. Then I stumbled across the switch -fno-function-cse which happily stops gcc from doing the stupidity above. Even with -Os now we don't get label stores to regs. I'd recommend everyone add this to their CFLAGS in their Makefiles if using optimisation. So I'm going to consider this issue as "closed won't fix" as there is a valid workaround 🙂 6 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted January 26 Share Posted January 26 2 hours ago, khanivore said: Looking at the generated code, while gcc normally calls a function like this: bl @puts If you optimise using -O2 or -Os it decides to use an intermediate register for some reason: li r2, puts bl *r2 which is generally worse not better. It might be slightly better if the load was outside a loop but the compiler rarely does the load outside the loop from what I've seen. Even worse, when it runs short of registers it spills the reg to the stack resulting in this absolute mess: li r2, testColorText mov r2, @>4(r10) ... mov @>4(r10), r2 bl *r2 I've been desperately trying to find a way to stop it from storing labels to registers before doing a branch or call and I just can't. I added cost functions that made calls to registers prohibitively expensive but they made no difference. I tried a peephole which sometimes worked but it crashed when I encountered cases like above where the label had been pushed to the stack. Then I stumbled across the switch -fno-function-cse which happily stops gcc from doing the stupidity above. Even with -Os now we don't get label stores to regs. I'd recommend everyone add this to their CFLAGS in their Makefiles if using optimisation. So I'm going to consider this issue as "closed won't fix" as there is a valid workaround 🙂 That's an amazing win. Well done. I am absolutely flabbergasted at the complexity of this monster, and the patience you guys have to keeping flipping switches until it does your bidding. It flies in the exact opposite direction of what I play with... but it gets results. I joked about writing the lisp dialect that controls this GCC process in RPN but after pondering it for while I realize it is not a crazy idea. Forth like LISP makes creating a domain specific language pretty simple. in fact boolean functions in RPN are quite neat and have far less brackets. Example so you know what I mean and then I will stop distracting the thread. HEX DEAD 00FF AND 0080 OR -1 XOR . \ dot prints the result \ logic language extensions : NAND ( n n --?) AND NOT ; : NOR ( n n --?) OR NOT ; Your head stops hurting after a short time using it and testing stuff in the REPL. >>We now rejoin our regularly scheduled program already in progress<< 1 3 Quote Link to comment Share on other sites More sharing options...
+chue Posted January 26 Share Posted January 26 8 hours ago, khanivore said: Then I stumbled across the switch -fno-function-cse which happily stops gcc from doing the stupidity above. The docs seem to say that there are "hacks" in place which sometimes get confused (emphasis mine): Quote -fno-function-cse Do not put function addresses in registers; make each instruction that calls a constant function contain the function's address explicitly. This option results in less efficient code, but some strange hacks that alter the assembler output may be confused by the optimizations performed when this option is not used. source: https://gcc.gnu.org/onlinedocs/gcc-4.4.0/gcc/Optimize-Options.html 4 Quote Link to comment Share on other sites More sharing options...
JasonACT Posted January 26 Share Posted January 26 6 minutes ago, chue said: source: https://gcc.gnu.org/onlinedocs/gcc-4.4.0/gcc/Optimize-Options.html I read, when they compile the Unix kernel for some targets, there are processes that work on the GCC assembler output doing some substitutions that need the branch instruction to contain the address. See the gcc/common.opt file where ffunction-cse is defined. @khanivore In common.opt if you add " Init(1)" to the end of "Common Report Var(flag_no_function_cse,0)" then the default will be -fno-function-cse, which can then be switched on with -ffunction-cse if the end-user chooses it. 1 Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 27 Share Posted January 27 13 hours ago, khanivore said: Then I stumbled across the switch -fno-function-cse which happily stops gcc from doing the stupidity above. Even with -Os now we don't get label stores to regs. I'd recommend everyone add this to their CFLAGS in their Makefiles if using optimisation. So I'm going to consider this issue as "closed won't fix" as there is a valid workaround I'll take it! I noticed much like you, 9 times out of 10 it ends up far less efficient. 1 Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 27 Share Posted January 27 (edited) I was really happy with the improvement in code size: Got cart size of 256k ## BANK ROM NAME LOAD FREE Old Gained -- ---- ------ ----------- ---- ----- ---- ------ 00 6000 000000 loader 6000 0 0 0 01 6002 002000 fixed A000 0 0 0 02 6004 004000 0000 3768 3340 428 combination of first 3 banks - loader, init, libraries, trampolines 03 6006 006000 bank1b 602E 974 974 0 graphics 04 6008 008000 bank2a 602E 2102 2008 94 player code 05 600A 00A000 bank2b 602E 1394 1372 22 player code and end sequence 06 600C 00C000 bank3a 602E 1054 1054 0 music 07 600E 00E000 bank3b 602E 648 590 58 graphics, sound test 08 6010 010000 bank4a 602E 2216 2144 72 menus, sound effect data 09 6012 012000 bank4b 602E 974 974 0 music 10 6014 014000 bank5a 602E 925 925 0 graphics 11 6016 016000 bank5b 602E 802 802 0 graphics and text data 12 6018 018000 bank6a 602E 1542 1524 18 starfield code, music 13 601A 01A000 bank6b 602E 610 610 0 music, graphics 14 601C 01C000 bank7a 602E 532 282 250 boss code 15 601E 01E000 bank7b 602E 1030 1030 0 assembly, f18a data and init 16 6020 020000 bank8a 602E 2178 2044 134 win code, boss data 17 6022 022000 bank8b 602E 704 502 202 win code 18 6024 024000 bank9a 602E 1797 1779 18 title page code, graphics 19 6026 026000 bank9b 602E 1146 1150 -4 graphics, f18a init (interesting! rare case where it helped) 20 6028 028000 bank10a 602E 1618 1618 0 graphics 21 602A 02A000 bank10b 602E 1618 1618 0 graphics 22 602C 02C000 bank11a 602E 2002 2002 0 graphics 23 602E 02E000 bank11b 602E 1701 1701 0 graphics 24 6030 030000 bank12a 602E 794 784 10 graphics, attract mode 25 6032 032000 bank12b 602E 59 59 0 graphics 26 6034 034000 bank13a 602E 779 779 0 graphics 27 6036 036000 bank13b 602E 384 388 -4 graphics, f18a init (and one more just to buck the trend!) 28 6038 038000 bank14a 602E 124 112 12 f18a graphics and loaders 29 603A 03A000 bank14b 602E 988 824 164 enemy code, win sequences 30 603C 03C000 bank15a 602E 50 50 0 f18a graphics 31 603E 03E000 bank15b 602E 184 190 -6 f18a loader, graphics (and a little more) TOTAL SAVED: 1468 bytes and countless cycles. Wrote 'ssa8.bin' The F18A load functions where the method of loading a function address into a register saved a few bytes were pretty extreme, they all were variations of this function: void initLadybugf18() { // add 32 to VDP address, 16 to sprite table address vdpmemcpy(108*8+0x0800, &F18LADYBUG[0*8], 4*4*8); // ship straight sprites layer 1 vdpmemcpy(140*8+0x0800, &F18LADYBUG[16*8], 4*4*8); // ship left sprites layer 1 vdpmemcpy(172*8+0x0800, &F18LADYBUG[32*8], 4*4*8); // ship right sprites layer 1 vdpmemcpy(108*8+0x1000, &F18LADYBUG2[0*8], 4*4*8); // ship straight sprites layer 2 vdpmemcpy(140*8+0x1000, &F18LADYBUG2[16*8], 4*4*8); // ship left sprites layer 2 vdpmemcpy(172*8+0x1000, &F18LADYBUG2[32*8], 4*4*8); // ship right sprites layer 2 vdpmemcpy(108*8+0x1800, &F18LADYBUG3[0*8], 4*4*8); // ship straight sprites layer 2 vdpmemcpy(140*8+0x1800, &F18LADYBUG3[16*8], 4*4*8); // ship left sprites layer 2 vdpmemcpy(172*8+0x1800, &F18LADYBUG3[32*8], 4*4*8); // ship right sprites layer 2 vdpmemcpy(124*8+0x0800, F18ALTLADYBUG, 4*4*8); // straight vdpmemcpy(156*8+0x0800, &F18ALTLADYBUG[4*4*8], 4*4*8); // left vdpmemcpy(188*8+0x0800, &F18ALTLADYBUG[8*4*8], 4*4*8); // right vdpmemcpy(124*8+0x1000, F18ALTLADYBUG2, 4*4*8); // straight vdpmemcpy(156*8+0x1000, &F18ALTLADYBUG2[4*4*8], 4*4*8); // left vdpmemcpy(188*8+0x1000, &F18ALTLADYBUG2[8*4*8], 4*4*8); // right vdpmemcpy(124*8+0x1800, F18ALTLADYBUG3, 4*4*8); // straight vdpmemcpy(156*8+0x1800, &F18ALTLADYBUG3[4*4*8], 4*4*8); // left vdpmemcpy(188*8+0x1800, &F18ALTLADYBUG3[8*4*8], 4*4*8); // right loadpal_f18a(F18LADYBUGPAL, PAL_PLAYSHIP*4+1, 7); } Safe to say that's a pretty unusual case and arguably isn't even the best way to write that code (loops could have been used to save code space). Unfortunately, just as I was writing the words "no issues" the game crashed. I got through libti99, vgmcomp2, and Mario Bros without issue so I hadn't yet tested Super Space Acer. I'll need to dig into what went wrong a bit later tonight... but it seems reproducible (it crashes good right when an enemy is shot, and based on the horrible sounds it's making, I suspect it's the sound effect code ) Edited January 27 by Tursi 1 Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 27 Share Posted January 27 (edited) Bug landed in doAllMusic(), while counting down the register list for a sound effect. Please forgive the amount of inline code here, I didn't want to hide everything. // regular case - music plus sound effects on SID (if present) void doAllMusic() { unsigned int old = nBank; // no music in demo, but sfx are okay if (joynum != 0) { // warning: changes bank playSNMusic(SOUNDCHIP); // stock sound chip } // run sound effects at 30 hz - SID version SWITCH_IN_BANK4a; if (VDP_INT_COUNTER & 1) { if (NULL != pSfx) { // SFX data format: // number registers, [register number, register data] unsigned char regs = *(pSfx++); if (0 == regs) { pSfx = NULL; // we'll undo the block next frame } else { while (regs--) { pSfx = processSID(pSfx); } } } else { blockSfx = 0; // make sure we didn't forget to clear something } } else { if (NULL != pShoot) { // SFX data format: // number registers, [register number, register data] unsigned char regs = *(pShoot++); if (0 == regs) { pShoot = NULL; } else { while (regs--) { pShoot = processSID(pShoot); } } } } SWITCH_IN_PREV_BANK(old); } Despite the size, it boils down to two pretty much identical loops calling processSID, and both have the same bug. We are using WORD instructions to count down a BYTE, but we never cleared the least significant byte of the WORD, so it never reaches a WORD zero. def doAllMusic doAllMusic ai r10, >FFFA stack mov r9, @>4(r10) mov r11, @>2(r10) mov @nBank, *r10 save bank mov @joynum, r0 check joynum jeq L120 skip music if 0 li r1, >8400 play on SN bl @playSNMusic go do it L120 li r2, >6010 bank change mov r2, @nBank save it mov r2, @>6010 do it movb @>8379, r1 get int counter srl r1, 8 make int (unnecessary, why not AND with >0100?) andi r1, >1 test LSB (in my case it's set) jeq L121 jump if 0 mov @pSfx, r1 get SFX pointer jeq L122 skip out if zero movb *r1, r9 get count byte (post inc expected) inc r1 postinc here? why not *r1+ then? mov r1, @pSfx save result back to pSfx movb r9, r0 check for zero jne L127 skip ahead if not (must not be cause we're making noise) clr r1 (if it was), make a zero mov r1, @pSfx and zero pSfx (why not clr @psfx?) jmp L124 and head out L127 mov @pSfx, r1 get pointer into R1 bl @processSID go process one SID instruction (seems okay, see above) mov r1, @pSfx save the result back in the pointer ai r9, >FF00 count down R9 MSB jne L127 check if it reached zero <--- bug, R9 is unsigned char and we didn't zero the LSB jmp L124 else we're done L122 clr r2 (just make sure blockSfx is cleared if there was no sfx playing) movb r2, @blockSfx jmp L124 L121 mov @pShoot, r2 (other side of the if) jeq L124 movb *r2, r9 inc r2 mov r2, @pShoot movb r9, r0 jne L128 mov r1, @pShoot jmp L124 L128 mov @pShoot, r1 bl @processSID mov r1, @pShoot ai r9, >FF00 jne L128 <---- just checking, same bug on this side of the code L124 mov *r10, r1 restore bank mov r1, @nBank mov r1, *r1 inct r10 restore stack mov *r10+, r11 mov *r10+, r9 b *r11 back to caller .size doAllMusic, .-doAllMusic even So why didn't it happen without -fno-function-cse? def doAllMusic doAllMusic ai r10, >FFF8 mov r9, @>6(r10) mov r11, @>4(r10) mov @nBank, *r10 mov @joynum, r0 jeq L120 li r1, >8400 bl @playSNMusic L120 li r2, >6010 mov r2, @nBank mov r2, @>6010 movb @>8379, r1 srl r1, 8 andi r1, >1 jeq L121 mov @pSfx, r1 jeq L122 movb *r1, r9 we still get a byte without clearing inc r1 mov r1, @pSfx movb r9, r0 we still check it as a byte above the loop jne L123 clr r1 mov r1, @pSfx jmp L124 L123 li r2, processSID function in register L127 mov @pSfx, r1 mov r2, @>2(r10) bl *r2 save to stack and call it mov r1, @pSfx put return back in pSfx ai r9, >FF00 same AI >FF00 to decrement mov @>2(r10), r2 but then we restore R2 from the stack movb r9, r0 and use an explicit test for byte zero jne L127 jmp L124 L122 clr r2 movb r2, @blockSfx jmp L124 L121 mov @pShoot, r2 jeq L124 movb *r2, r9 inc r2 mov r2, @pShoot movb r9, r0 jne L125 mov r1, @pShoot jmp L124 L125 li r2, processSID L128 mov @pShoot, r1 mov r2, @>2(r10) bl *r2 mov r1, @pShoot ai r9, >FF00 ; addqi3 mov @>2(r10), r2 movb r9, r0 jne L128 L124 mov *r10, r1 mov r1, @nBank mov r1, *r1 ai r10, >4 ; add hi3 mov *r10+, r11 mov *r10+, r9 b *r11 .size doAllMusic, .-doAllMusic even So it looks like it might be fluke. I suspect what's happened is that we don't differentiate whether a condition code of Equal was for a byte or a word compare. GCC knows the AI set condition codes, but it can't know that the code was set for the entire WORD rather than just the BYTE we care about. As much as it might feel wasteful, it's pretty common even in hand-rolled assembly to clear a register before moving bytes to it. Maybe we should consider that here? The alternative is probably to stop using AI for byte math (instead using AB/SB), that way the condition codes will always be correct. Ooooor... maybe with AI on byte values we could tell the compiler that it invalidates the condition codes, so that a test is generated if it's needed? Edited January 27 by Tursi 3 Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 27 Share Posted January 27 (I tried the AB approach since we have R0 right there... it didn't cost too much so the net benefit was large, and I didn't hit any other issues.) Quote Link to comment Share on other sites More sharing options...
JasonACT Posted January 27 Share Posted January 27 MCVE: #define NULL 0 unsigned char * processSID (unsigned char *); unsigned char * pSfx; // Compile with: -Os or -O2 with -fno-function-cse void doAllMusic() { unsigned char regs = *(pSfx++); if (0 == regs) { pSfx = NULL; } else { while (regs--) { pSfx = processSID(pSfx); } } } Nice find. 1 Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 27 Share Posted January 27 7 hours ago, JasonACT said: MCVE: #define NULL 0 unsigned char * processSID (unsigned char *); unsigned char * pSfx; // Compile with: -Os or -O2 with -fno-function-cse void doAllMusic() { unsigned char regs = *(pSfx++); if (0 == regs) { pSfx = NULL; } else { while (regs--) { pSfx = processSID(pSfx); } } } Nice find. Thanks for that! 1 Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 28 Share Posted January 28 Unfortunately another case turned up - this one is a regression over the old compiler. I have this function which checks whether two bytes in external memory are bitwise NOT to each other. The test fails because the code is sign extending and forgetting that it matters. #define UBERGROM_RD *((volatile unsigned char*)0x983c) #define UBERGROM_WD *((volatile unsigned char*)0x9C3c) #define UBERGROM_CHECK 0xf800 int checkHighScores() { // just check the configuration bits. If bytes 0 and 1 // are not inverted copies of each other, assume no ubergrom // config space is always mapped, so we can just go ahead and read GROM_SET_ADDRESS(UBERGROM_CHECK); unsigned char a = UBERGROM_RD; unsigned char b = UBERGROM_RD; if (a != (~b)) return 0; return 1; } def checkHighScores checkHighScores li r1, >F800 hard coded address in GROM * Begin inline assembler code * 41 "/mnt/d/work/libti99ALL/grom.h" 1 movb r1,@>9c02 load address (inline asm) swpb r1 movb r1,@>9c02 swpb r1 * 0 "" 2 * End of inline assembler code movb @>983C, r3 read from GROM (base 15 for Ubergrom config) movb @>983C, r2 and again for the next byte clr r1 zero return code srl r3, 8 make byte to int (>00aa) srl r2, 8 make byte to int (>00bb) inv r2 invert second word (>ffBB) c r3, r2 test equal <--- bug? - the MSB won't match because of the invert jne L4 skip if different li r1, >1 else change return code to 1 L4 b *r11 back to caller .size checkHighScores, .-checkHighScores If I make them signed chars, then the promotion sign extends the values and the inv does the right thing. But I should be able to invert and test unsigned chars as well? (-Os and -no-function-cse) This worked on the old compiler, interestingly. I had this code in gromcfg (doing essentially the same thing): unsigned char GetConfigByte() { unsigned char x,y; x = GromReadData(0xF800, 15); y = GromReadData(0xF801, 15); y = ~y; if (x != y) { x=0; } return x; } And this definitely worked. def GetConfigByte GetConfigByte ai r10, >FFFA stack setup mov r10, r0 mov r11, *r0+ mov r9, *r0+ mov r13, *r0+ li r13, GromReadData store function address li r1, >F800 get base address li r2, >F00 get grom base index as a byte bl *r13 call GromReadData to read a byte movb r1, r9 store x in r9 li r1, >F801 next byte from GROM li r2, >F00 same base bl *r13 call GromReadData inv r1 invert result cb r9, r1 byte compare <---- since we stayed working with bytes, it was fine jeq L15 if equal, skip ahead clr r9 else invalid, so zero the byte L15 movb r9, r1 mov *r10+, r11 mov *r10+, r9 mov *r10+, r13 b *r11 even Why did it promote the comparison? 1 Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 28 Share Posted January 28 Something to do with having the invert in the comparison. If I change the code to look like the older code with an explicit invert, it does the right thing: int checkHighScores() { // just check the configuration bits. If bytes 0 and 1 // are not inverted copies of each other, assume no ubergrom // config space is always mapped, so we can just go ahead and read GROM_SET_ADDRESS(UBERGROM_CHECK); signed char a = UBERGROM_RD; signed char b = UBERGROM_RD; b = ~b; // <--- do it explicitly if (a != b) return 0; return 1; } def checkHighScores checkHighScores li r1, >F800 * Begin inline assembler code * 41 "/mnt/d/work/libti99ALL/grom.h" 1 movb r1,@>9c02 swpb r1 movb r1,@>9c02 swpb r1 * 0 "" 2 * End of inline assembler code movb @>983C, r3 get a movb @>983C, r2 get b clr r1 clear return inv r2 invert b cb r3, r2 <--- now it is a byte compare, code works jne L4 li r1, >1 L4 b *r11 .size checkHighScores, .-checkHighScores 1 Quote Link to comment Share on other sites More sharing options...
JasonACT Posted January 29 Share Posted January 29 14 hours ago, Tursi said: Why did it promote the comparison? ...and @khanivore this might be a bug in the version 4 compiler... volatile unsigned char a; volatile unsigned char c; int ttt() { if (a != (~c)) return 0; return 1; }; Compiles to this with the 4.3.3 ARM version of GCC that I use for an old device I write software for sometimes: .file "test.c" .text .align 2 .global ttt .type ttt, %function ttt: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r3, .L3 ldrb r2, [r3, #0] @ zero_extendqisi2 ldr r3, .L3+4 ldrb r0, [r3, #0] @ zero_extendqisi2 mvn r0, r0 cmp r2, r0 movne r0, #0 moveq r0, #1 bx lr .L4: .align 2 .L3: .word a .word c .size ttt, .-ttt .comm a,1,1 .comm c,1,1 .ident "GCC: (GNU) 4.3.3" If I'm not mistaken, it's doing the exact same thing. Those are not my comments - that's the original GCC ARM output. 3 Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 29 Share Posted January 29 (edited) Interesting find! I went over to Godbolt.org to try and understand the code... and the first thing I noticed was the default compiler (x86-64 gcc trunk) returns a warning: "warning: comparison of promoted bitwise complement of an unsigned value with unsigned [-Wsign-compare]" That implies that this is known and not intended to function correctly? I have to admit that seems counter-intuitive to me, but if it's meant to be, perhaps it's not something we need to fix? I've not often had to write code that thinks about the promotion, but of course if I cast the result of the invert, that also works. if (a != (unsigned char)(~c)) return 0; So it's clearly not a regression, the compiler literally thinks that the two forms are very different in intent. ((Edit: sorry if you read this live, I wrote it while testing and ended up editting a few times )) Edited January 29 by Tursi 1 1 Quote Link to comment Share on other sites More sharing options...
JasonACT Posted January 29 Share Posted January 29 8 minutes ago, Tursi said: perhaps it's not something we need to fix? I'll be a hard one to fix, that's for sure. Probably what someone else thought when they added the warning 1 Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 29 Share Posted January 29 5 minutes ago, JasonACT said: I'll be a hard one to fix, that's for sure. Probably what someone else thought when they added the warning Yeah, I think so. I can find people arguing about it all the back to 2008! https://gcc.gnu.org/legacy-ml/gcc-bugs/2008-12/msg00046.html 1 2 Quote Link to comment Share on other sites More sharing options...
JasonACT Posted January 29 Share Posted January 29 4 hours ago, Tursi said: Yeah, I think so. I can find people arguing about it all the back to 2008! https://gcc.gnu.org/legacy-ml/gcc-bugs/2008-12/msg00046.html I'm adding -Wall and -W to my compiles from now on (-W is apparently the same [but an old version of/] as -Wextra). Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.