GCC for the TI

+khanivore · January 17

On 12/24/2023 at 9:42 PM, Tursi said:

aed4:
  sb r2,r1  subtract r2 from r1 again
  neg r1    but negate (not invert!) the result so the sign is right <--- BUG because the LSB is undefined

Regarding this one, I tested removing the alternate constraints and instead of this:

  sb   r2, r1
  neg  r1

the compiler emitted this:

  sb   r1, r2

which seems much better to me, so I'm going to just delete the alternate set of constraints that support swapped operands.

Tursi · January 18

18 hours ago, khanivore said:
Regarding this one, I tested removing the alternate constraints and instead of this:
  sb   r2, r1
  neg  r1
the compiler emitted this:
  sb   r1, r2    
which seems much better to me, so I'm going to just delete the alternate set of constraints that support swapped operands.

Nice... I hope it all works. I don't have my test cases nicely isolated anymore so if it breaks investigating will start from scratch again.

JasonACT · January 20

On 1/18/2024 at 12:05 AM, khanivore said:

Returning to this, hoping to get a new release out today or tomorrow. It seems by looking through the RTL dumps (-da command line switch) that the sub_const_hi insn was added by the peephole pass to optimise a comparison when the compare value is +/- 1 or 2. Unfortunately, it has replaced a CI with LI and S which is less efficient. If I just delete that peephole and the sub_const_hi insn then it emits CI r2,sch+18 which is much better.

Do I delete the "Optimizations for Comparisons" section including both X == {-2,-1,1,2} and also X != {-2,-1,1,2} (along with the insn)? Well, I tried that anyway, and you're right - it is better on at least the example we're testing... It would be nice to enhance the peephole though, if possible, to only work on those 4 integer constants, otherwise go with a CI. Do you happen to know how the peephole was adding that sub? I know they are related, I've looked at the RTL dumps, but I'm not really following how the comparison became a sub.

On 1/18/2024 at 2:05 AM, khanivore said:
Regarding this one, I tested removing the alternate constraints and instead of this:
  sb   r2, r1
  neg  r1
the compiler emitted this:
  sb   r1, r2    
which seems much better to me, so I'm going to just delete the alternate set of constraints that support swapped operands.

I'm not sure I know enough about the .md file to try this without seeing your edits. It looks really good though

+khanivore · January 20

On 1/18/2024 at 9:37 AM, Tursi said:

Nice... I hope it all works. I don't have my test cases nicely isolated anymore so if it breaks investigating will start from scratch again.

I'm maintaining a set of tests and trying to capture each failed case as it's reported. I can't guarantee it will catch everything but hopeful will reduce the likelihood of a regression.

+khanivore · January 20

6 hours ago, JasonACT said:

Do I delete the "Optimizations for Comparisons" section including both X == {-2,-1,1,2} and also X != {-2,-1,1,2} (along with the insn)?

Yep, delete lines 2115-2175 in tms9900.md on main.

6 hours ago, JasonACT said:

It would be nice to enhance the peephole though, if possible, to only work on those 4 integer constants, otherwise go with a CI

Nothing comes to mind here. I don't see that CLR rX; DEC rX; C rX, rY is any better than LI Rx, -1; C Rx ? Maybe if would be useful to emit a table of constants somewhere with entries like DATA_FF so we could do CB Rx,@DATA_FF instead of using LI so much?

6 hours ago, JasonACT said:

Do you happen to know how the peephole was adding that sub?

I think the rewrite matches sub-const-hi. Peepholes, like insns, work on predicates, so I'm presuming that:

(define_peephole2
  [(set (cc0)
        (compare (match_operand:HI 0 "register_operand" "r")
                 (match_operand:HI 1 "immediate_operand" "LMNO")))
   (set (pc) (if_then_else (eq (cc0) (const_int 0))
                           (label_ref (match_operand 2 "" ""))
                           (pc)))]
  "peep2_reg_dead_p(2, operands[0])"
  [(set (match_operand:HI 0 "register_operand" "")
        (plus:HI (match_dup 0) (neg:HI (match_dup 1))))
   (set (cc0) (match_dup 0))
   (set (pc) (if_then_else (eq (cc0) (const_int 0))
                           (label_ref (match_dup 2))
                           (pc)))]
  )

means if there is sequence of insns where a compare of a reg to a const LMNO (-2,+2,-1,+1,0) [and I'm not sure constraints actually work in peepholes, the constant we saw was NOT one of LMNO] is followed by an if_then_else and the reg_dead condition is met (the reg of operands[0] is not required again after this sequence) then replace those two insns with a 3 insns where the first is a set of operand[0] to be the plus of operand[0] and the negation of operand[1] followed by a set of the condition code (which emits the compare) followed by the if_then_else.

After the predicates have been rewritten by the peephole, then the set (plus (neg)) matches the "sub_const_hi" insn so it gets emitted. At least that's my understanding of how it works.

I'm debugging the peepholes a bit now but tbh many of them don't make a lot of sense to me and as far as I can tell from debug output none of them except the above are being hit. I'm kind of inclined to delete them all and add very specific targeted ones back in as cases arise where there is an obvious need.

JasonACT · January 20

27 minutes ago, khanivore said:

[and I'm not sure constraints actually work in peepholes, the constant we saw was NOT one of LMNO]

Yeah, that bit in particular was why I can't work out what my next move would be.

27 minutes ago, khanivore said:

tbh many of them don't make a lot of sense to me and as far as I can tell from debug output none of them except the above are being hit.

Both you and me.

27 minutes ago, khanivore said:

I'm kind of inclined to delete them all and add very specific targeted ones back in as cases arise where there is an obvious need.

I'll take a look tomorrow, they must have been written for something that looked useful at the time. I'll see what I can find.

Thanks!

JasonACT · January 20

10 hours ago, khanivore said:

I'm kind of inclined to delete them all and add very specific targeted ones back in as cases arise where there is an obvious need.

10 hours ago, JasonACT said:

I'll take a look tomorrow, they must have been written for something that looked useful at the time. I'll see what I can find.

volatile unsigned int j;
volatile char b;

// Compile with: -Os

void test () {
    // ;; Optimization for (unsigned char)X < N
    if ((unsigned char)b < 2)
        j = 0;
    // ;; Optimization for (unsigned char)X >= N
    if ((unsigned char)b >= 2)
        j = 0;
    // ;; Optimization for (unsigned char)X <= N
    if ((unsigned char)b <= 2)
        j = 0;
    // ;; Optimization for (char)X > N
    if (b > 2)
        j = 0;
    // ;; Optimization for (char)X < N
    if (b < 2)
        j = 0;
    // ;; Optimization for (char)X >= N
    if (b >= 2)
        j = 0;
    // ;; Optimization for (char)X <= N
    if (b <= 2)
        j = 0;
}

I didn't check specifically that 1.19 was using these peepholes, I'd need to add a heap of debugging to that old version, but I think it is...

Output from 1.19:

Spoiler

test
   movb @b, r2
   ci r2, >1FF
   jh L2
   clr @j
L2
   movb @b, r2
   ci r2, >1FF
   jle L3
   clr @j
L3
   movb @b, r2
   ci r2, >2FF
   jh L4
   clr @j
L4
   movb @b, r2
   ci r2, >2FF
   jlt L5
   jeq L5
   clr @j
L5
   movb @b, r2
   ci r2, >1FF
   jgt L6
   clr @j
L6
   movb @b, r2
   ci r2, >1FF
   jlt L7
   jeq L7
   clr @j
L7
   movb @b, r2
   ci r2, >2FF
   jgt L9
   clr @j
L9
   b *r11

Output from 1.28:

Spoiler

test
   movb @b, r1
   li r0, >100
   cb r1, r0
   jh L2
   clr r1
   mov r1, @j
L2
   movb @b, r1
   li r0, >100
   cb r1, r0
   jle L3
   clr r1
   mov r1, @j
L3
   movb @b, r1
   li r0, >200
   cb r1, r0
   jh L4
   clr r1
   mov r1, @j
L4
   movb @b, r1
   li r0, >200
   cb r1, r0
   jlt L5
   jeq L5
   clr r1
   mov r1, @j
L5
   movb @b, r1
   li r0, >100
   cb r1, r0
   jgt L6
   clr r1
   mov r1, @j
L6
   movb @b, r1
   li r0, >100
   cb r1, r0
   jlt L7
   jeq L7
   clr r1
   mov r1, @j
L7
   movb @b, r1
   li r0, >200
   cb r1, r0
   jgt L9
   clr r1
   mov r1, @j
L9
   b *r11

But I think you're right, they are no longer being hit.

+khanivore · January 21

On 1/20/2024 at 11:59 AM, JasonACT said:

I'll take a look tomorrow, they must have been written for something that looked useful at the time. I'll see what I can find.

Very interesting, thanks. It looks like the replacement of "clr r1; mov r1,@j" with "clr @j" was lost in a tidy up of the constraints for "movhi". In 1.19 the constraints O and M were allowed as targets for a dest type of Q or R. I can put these back in to reduce one insn.

The "ci r2,>1ff" most likely was a peephole. Possibly no longer being matched since cmpqi has tighter constraints? But it is a clever optimisation and is neater than LI and CB. I was thinking we could alter the cmpqi output but I think it depends on the following jump condition too so probably a peephole is the only way to achieve it. I'll see if I can update the peephole to work with the latest cmpqi ....

+khanivore · January 26

Looking at the generated code, while gcc normally calls a function like this:

        bl   @puts

If you optimise using -O2 or -Os it decides to use an intermediate register for some reason:

        li   r2, puts                                                                                                 
        bl   *r2

which is generally worse not better. It might be slightly better if the load was outside a loop but the compiler rarely does the load outside the loop from what I've seen.

Even worse, when it runs short of registers it spills the reg to the stack resulting in this absolute mess:

        li   r2, testColorText                                                                                        
        mov  r2, @>4(r10)                                                                                             
        ...        
        mov  @>4(r10), r2                                                                                             
        bl   *r2

I've been desperately trying to find a way to stop it from storing labels to registers before doing a branch or call and I just can't. I added cost functions that made calls to registers prohibitively expensive but they made no difference. I tried a peephole which sometimes worked but it crashed when I encountered cases like above where the label had been pushed to the stack.

Then I stumbled across the switch -fno-function-cse which happily stops gcc from doing the stupidity above. Even with -Os now we don't get label stores to regs. I'd recommend everyone add this to their CFLAGS in their Makefiles if using optimisation. So I'm going to consider this issue as "closed won't fix" as there is a valid workaround 🙂

+TheBF · January 26

2 hours ago, khanivore said:
Looking at the generated code, while gcc normally calls a function like this:
        bl   @puts          
If you optimise using -O2 or -Os it decides to use an intermediate register for some reason:
        li   r2, puts                                                                                                 
        bl   *r2
which is generally worse not better. It might be slightly better if the load was outside a loop but the compiler rarely does the load outside the loop from what I've seen.

Even worse, when it runs short of registers it spills the reg to the stack resulting in this absolute mess:
        li   r2, testColorText                                                                                        
        mov  r2, @>4(r10)                                                                                             
        ...        
        mov  @>4(r10), r2                                                                                             
        bl   *r2
I've been desperately trying to find a way to stop it from storing labels to registers before doing a branch or call and I just can't. I added cost functions that made calls to registers prohibitively expensive but they made no difference. I tried a peephole which sometimes worked but it crashed when I encountered cases like above where the label had been pushed to the stack.

Then I stumbled across the switch -fno-function-cse which happily stops gcc from doing the stupidity above. Even with -Os now we don't get label stores to regs. I'd recommend everyone add this to their CFLAGS in their Makefiles if using optimisation. So I'm going to consider this issue as "closed won't fix" as there is a valid workaround 🙂

That's an amazing win. Well done.

I am absolutely flabbergasted at the complexity of this monster, and the patience you guys have to keeping flipping switches until it does your bidding.

It flies in the exact opposite direction of what I play with... but it gets results.

I joked about writing the lisp dialect that controls this GCC process in RPN but after pondering it for while I realize it is not a crazy idea.

Forth like LISP makes creating a domain specific language pretty simple.

in fact boolean functions in RPN are quite neat and have far less brackets.

Example so you know what I mean and then I will stop distracting the thread.

HEX

DEAD 00FF AND  0080 OR  -1 XOR .   \ dot prints the result

\ logic language extensions
: NAND ( n n --?)  AND NOT ; 
: NOR  ( n n --?)  OR NOT ;

Your head stops hurting after a short time using it and testing stuff in the REPL.

>>We now rejoin our regularly scheduled program already in progress<<

+chue · January 26

8 hours ago, khanivore said:

Then I stumbled across the switch -fno-function-cse which happily stops gcc from doing the stupidity above.

The docs seem to say that there are "hacks" in place which sometimes get confused (emphasis mine):

Quote

-fno-function-cse
Do not put function addresses in registers; make each instruction that calls a constant function contain the function's address explicitly.
This option results in less efficient code, but some strange hacks that alter the assembler output may be confused by the optimizations performed when this option is not used.

source: https://gcc.gnu.org/onlinedocs/gcc-4.4.0/gcc/Optimize-Options.html

JasonACT · January 26

6 minutes ago, chue said:

source: https://gcc.gnu.org/onlinedocs/gcc-4.4.0/gcc/Optimize-Options.html

I read, when they compile the Unix kernel for some targets, there are processes that work on the GCC assembler output doing some substitutions that need the branch instruction to contain the address. See the gcc/common.opt file where ffunction-cse is defined.

@khanivore In common.opt if you add " Init(1)" to the end of "Common Report Var(flag_no_function_cse,0)" then the default will be -fno-function-cse, which can then be switched on with -ffunction-cse if the end-user chooses it.

Tursi · January 27

13 hours ago, khanivore said:

Then I stumbled across the switch -fno-function-cse which happily stops gcc from doing the stupidity above. Even with -Os now we don't get label stores to regs. I'd recommend everyone add this to their CFLAGS in their Makefiles if using optimisation. So I'm going to consider this issue as "closed won't fix" as there is a valid workaround

I'll take it! I noticed much like you, 9 times out of 10 it ends up far less efficient.

Tursi · January 27

I was really happy with the improvement in code size:

Got cart size of 256k
##  BANK  ROM     NAME         LOAD  FREE    Old    Gained     
--  ----  ------  -----------  ----  -----  ----    ------
00  6000  000000  loader       6000      0     0       0
01  6002  002000  fixed        A000      0     0       0
02  6004  004000               0000   3768  3340     428    combination of first 3 banks - loader, init, libraries, trampolines
03  6006  006000  bank1b       602E    974   974       0    graphics
04  6008  008000  bank2a       602E   2102  2008      94    player code
05  600A  00A000  bank2b       602E   1394  1372      22    player code and end sequence
06  600C  00C000  bank3a       602E   1054  1054       0    music
07  600E  00E000  bank3b       602E    648   590      58    graphics, sound test
08  6010  010000  bank4a       602E   2216  2144      72    menus, sound effect data
09  6012  012000  bank4b       602E    974   974       0    music
10  6014  014000  bank5a       602E    925   925       0    graphics
11  6016  016000  bank5b       602E    802   802       0    graphics and text data
12  6018  018000  bank6a       602E   1542  1524      18    starfield code, music
13  601A  01A000  bank6b       602E    610   610       0    music, graphics
14  601C  01C000  bank7a       602E    532   282     250    boss code
15  601E  01E000  bank7b       602E   1030  1030       0    assembly, f18a data and init
16  6020  020000  bank8a       602E   2178  2044     134    win code, boss data
17  6022  022000  bank8b       602E    704   502     202    win code
18  6024  024000  bank9a       602E   1797  1779      18    title page code, graphics
19  6026  026000  bank9b       602E   1146  1150      -4    graphics, f18a init (interesting! rare case where it helped)
20  6028  028000  bank10a      602E   1618  1618       0    graphics
21  602A  02A000  bank10b      602E   1618  1618       0    graphics
22  602C  02C000  bank11a      602E   2002  2002       0    graphics
23  602E  02E000  bank11b      602E   1701  1701       0    graphics
24  6030  030000  bank12a      602E    794   784      10    graphics, attract mode
25  6032  032000  bank12b      602E     59    59       0    graphics
26  6034  034000  bank13a      602E    779   779       0    graphics
27  6036  036000  bank13b      602E    384   388      -4    graphics, f18a init (and one more just to buck the trend!)
28  6038  038000  bank14a      602E    124   112      12    f18a graphics and loaders
29  603A  03A000  bank14b      602E    988   824     164    enemy code, win sequences
30  603C  03C000  bank15a      602E     50    50       0    f18a graphics
31  603E  03E000  bank15b      602E    184   190      -6    f18a loader, graphics (and a little more)
                                       TOTAL SAVED: 1468 bytes and countless cycles. 

Wrote 'ssa8.bin'

The F18A load functions where the method of loading a function address into a register saved a few bytes were pretty extreme, they all were variations of this function:

void initLadybugf18() {
    // add 32 to VDP address, 16 to sprite table address
    vdpmemcpy(108*8+0x0800, &F18LADYBUG[0*8], 4*4*8);        // ship straight sprites layer 1
    vdpmemcpy(140*8+0x0800, &F18LADYBUG[16*8], 4*4*8);        // ship left sprites layer 1
    vdpmemcpy(172*8+0x0800, &F18LADYBUG[32*8], 4*4*8);        // ship right sprites layer 1
    vdpmemcpy(108*8+0x1000, &F18LADYBUG2[0*8], 4*4*8);        // ship straight sprites layer 2
    vdpmemcpy(140*8+0x1000, &F18LADYBUG2[16*8], 4*4*8);        // ship left sprites layer 2
    vdpmemcpy(172*8+0x1000, &F18LADYBUG2[32*8], 4*4*8);        // ship right sprites layer 2
    vdpmemcpy(108*8+0x1800, &F18LADYBUG3[0*8], 4*4*8);        // ship straight sprites layer 2
    vdpmemcpy(140*8+0x1800, &F18LADYBUG3[16*8], 4*4*8);        // ship left sprites layer 2
    vdpmemcpy(172*8+0x1800, &F18LADYBUG3[32*8], 4*4*8);        // ship right sprites layer 2

    vdpmemcpy(124*8+0x0800, F18ALTLADYBUG, 4*4*8);            // straight
    vdpmemcpy(156*8+0x0800, &F18ALTLADYBUG[4*4*8], 4*4*8);    // left
    vdpmemcpy(188*8+0x0800, &F18ALTLADYBUG[8*4*8], 4*4*8);    // right
    vdpmemcpy(124*8+0x1000, F18ALTLADYBUG2, 4*4*8);            // straight
    vdpmemcpy(156*8+0x1000, &F18ALTLADYBUG2[4*4*8], 4*4*8);    // left
    vdpmemcpy(188*8+0x1000, &F18ALTLADYBUG2[8*4*8], 4*4*8);    // right
    vdpmemcpy(124*8+0x1800, F18ALTLADYBUG3, 4*4*8);            // straight
    vdpmemcpy(156*8+0x1800, &F18ALTLADYBUG3[4*4*8], 4*4*8);    // left
    vdpmemcpy(188*8+0x1800, &F18ALTLADYBUG3[8*4*8], 4*4*8);    // right

    loadpal_f18a(F18LADYBUGPAL, PAL_PLAYSHIP*4+1, 7);
}

Safe to say that's a pretty unusual case and arguably isn't even the best way to write that code (loops could have been used to save code space).

Unfortunately, just as I was writing the words "no issues" the game crashed. I got through libti99, vgmcomp2, and Mario Bros without issue so I hadn't yet tested Super Space Acer. I'll need to dig into what went wrong a bit later tonight... but it seems reproducible (it crashes good right when an enemy is shot, and based on the horrible sounds it's making, I suspect it's the sound effect code )

Edited January 27 by Tursi

Tursi · January 27

Bug landed in doAllMusic(), while counting down the register list for a sound effect. Please forgive the amount of inline code here, I didn't want to hide everything.

// regular case - music plus sound effects on SID (if present)
void doAllMusic() {
    unsigned int old = nBank;

    // no music in demo, but sfx are okay
    if (joynum != 0) {
        // warning: changes bank
        playSNMusic(SOUNDCHIP);    // stock sound chip
    }
     
    // run sound effects at 30 hz - SID version
    SWITCH_IN_BANK4a;

    if (VDP_INT_COUNTER & 1) {
        if (NULL != pSfx) {
            // SFX data format:
            // number registers, [register number, register data]
            unsigned char regs = *(pSfx++);
            if (0 == regs) {
                pSfx = NULL;
                // we'll undo the block next frame
            } else {
                while (regs--) {
                    pSfx = processSID(pSfx);
                }
            }
        } else {
            blockSfx = 0;   // make sure we didn't forget to clear something
        }
    } else {
        if (NULL != pShoot) {
            // SFX data format:
            // number registers, [register number, register data]
            unsigned char regs = *(pShoot++);
            if (0 == regs) {
                pShoot = NULL;
            } else {
                while (regs--) {
                    pShoot = processSID(pShoot);
                }
            }
        }
    }

    SWITCH_IN_PREV_BANK(old);
}

Despite the size, it boils down to two pretty much identical loops calling processSID, and both have the same bug. We are using WORD instructions to count down a BYTE, but we never cleared the least significant byte of the WORD, so it never reaches a WORD zero.

    def    doAllMusic
doAllMusic
    ai   r10, >FFFA     stack
    mov  r9, @>4(r10)
    mov  r11, @>2(r10)
    mov  @nBank, *r10   save bank
    mov  @joynum, r0    check joynum
    jeq  L120           skip music if 0
    li   r1, >8400      play on SN
    bl   @playSNMusic   go do it
L120
    li   r2, >6010      bank change
    mov  r2, @nBank     save it
    mov  r2, @>6010     do it
    movb @>8379, r1     get int counter
    srl  r1, 8          make int (unnecessary, why not AND with >0100?)
    andi r1, >1         test LSB (in my case it's set)
    jeq  L121           jump if 0
    mov  @pSfx, r1      get SFX pointer
    jeq  L122           skip out if zero
    movb *r1, r9        get count byte (post inc expected)
    inc  r1             postinc here? why not *r1+ then?
    mov  r1, @pSfx      save result back to pSfx
    movb r9, r0         check for zero
    jne  L127           skip ahead if not (must not be cause we're making noise)
    clr  r1             (if it was), make a zero
    mov  r1, @pSfx      and zero pSfx (why not clr @psfx?)
    jmp  L124           and head out
L127
    mov  @pSfx, r1      get pointer into R1
    bl   @processSID    go process one SID instruction (seems okay, see above)
    mov  r1, @pSfx      save the result back in the pointer
    ai   r9, >FF00      count down R9 MSB
    jne  L127           check if it reached zero <--- bug, R9 is unsigned char and we didn't zero the LSB
    jmp  L124           else we're done
L122
    clr  r2             (just make sure blockSfx is cleared if there was no sfx playing)
    movb r2, @blockSfx
    jmp  L124
L121
    mov  @pShoot, r2    (other side of the if)
    jeq  L124
    movb *r2, r9
    inc  r2
    mov  r2, @pShoot
    movb r9, r0
    jne  L128
    mov  r1, @pShoot
    jmp  L124
L128
    mov  @pShoot, r1
    bl   @processSID
    mov  r1, @pShoot
    ai   r9, >FF00
    jne  L128           <---- just checking, same bug on this side of the code
L124
    mov  *r10, r1       restore bank
    mov  r1, @nBank
    mov  r1, *r1
    inct r10            restore stack
    mov  *r10+, r11
    mov  *r10+, r9
    b    *r11           back to caller
    .size    doAllMusic, .-doAllMusic
    even

So why didn't it happen without -fno-function-cse?

    def    doAllMusic
doAllMusic
    ai   r10, >FFF8
    mov  r9, @>6(r10)
    mov  r11, @>4(r10)
    mov  @nBank, *r10
    mov  @joynum, r0
    jeq  L120
    li   r1, >8400
    bl   @playSNMusic
L120
    li   r2, >6010
    mov  r2, @nBank
    mov  r2, @>6010
    movb @>8379, r1
    srl  r1, 8
    andi r1, >1
    jeq  L121
    mov  @pSfx, r1
    jeq  L122
    movb *r1, r9        we still get a byte without clearing
    inc  r1
    mov  r1, @pSfx
    movb r9, r0         we still check it as a byte above the loop
    jne  L123
    clr  r1
    mov  r1, @pSfx
    jmp  L124
L123
    li   r2, processSID     function in register
L127
    mov  @pSfx, r1
    mov  r2, @>2(r10)
    bl   *r2                save to stack and call it
    mov  r1, @pSfx          put return back in pSfx
    ai   r9, >FF00          same AI >FF00 to decrement
    mov  @>2(r10), r2       but then we restore R2 from the stack
    movb r9, r0             and use an explicit test for byte zero
    jne  L127
    jmp  L124
L122
    clr  r2
    movb r2, @blockSfx
    jmp  L124
L121
    mov  @pShoot, r2
    jeq  L124
    movb *r2, r9
    inc  r2
    mov  r2, @pShoot
    movb r9, r0
    jne  L125
    mov  r1, @pShoot
    jmp  L124
L125
    li   r2, processSID
L128
    mov  @pShoot, r1
    mov  r2, @>2(r10)
    bl   *r2
    mov  r1, @pShoot
    ai   r9, >FF00   ; addqi3
    mov  @>2(r10), r2
    movb r9, r0
    jne  L128
L124
    mov  *r10, r1
    mov  r1, @nBank
    mov  r1, *r1
    ai   r10, >4   ; add hi3
    mov  *r10+, r11
    mov  *r10+, r9
    b    *r11
    .size    doAllMusic, .-doAllMusic
    even

So it looks like it might be fluke. I suspect what's happened is that we don't differentiate whether a condition code of Equal was for a byte or a word compare. GCC knows the AI set condition codes, but it can't know that the code was set for the entire WORD rather than just the BYTE we care about.

As much as it might feel wasteful, it's pretty common even in hand-rolled assembly to clear a register before moving bytes to it. Maybe we should consider that here?

The alternative is probably to stop using AI for byte math (instead using AB/SB), that way the condition codes will always be correct.

Ooooor... maybe with AI on byte values we could tell the compiler that it invalidates the condition codes, so that a test is generated if it's needed?

Edited January 27 by Tursi

Tursi · January 27

(I tried the AB approach since we have R0 right there... it didn't cost too much so the net benefit was large, and I didn't hit any other issues.)

JasonACT · January 27

MCVE:

#define NULL 0
unsigned char * processSID (unsigned char *);
unsigned char * pSfx;

// Compile with: -Os or -O2 with -fno-function-cse

void doAllMusic() {
    unsigned char regs = *(pSfx++);
    if (0 == regs) {
        pSfx = NULL;
    } else {
        while (regs--) {
            pSfx = processSID(pSfx);
        }
    }
}

Nice find.

Tursi · January 27

7 hours ago, JasonACT said:

MCVE:

#define NULL 0
unsigned char * processSID (unsigned char *);
unsigned char * pSfx;

// Compile with: -Os or -O2 with -fno-function-cse

void doAllMusic() {
    unsigned char regs = *(pSfx++);
    if (0 == regs) {
        pSfx = NULL;
    } else {
        while (regs--) {
            pSfx = processSID(pSfx);
        }
    }
}

Nice find.

Thanks for that!

Tursi · January 28

Unfortunately another case turned up - this one is a regression over the old compiler.

I have this function which checks whether two bytes in external memory are bitwise NOT to each other. The test fails because the code is sign extending and forgetting that it matters.

#define UBERGROM_RD *((volatile unsigned char*)0x983c)
#define UBERGROM_WD *((volatile unsigned char*)0x9C3c)
#define UBERGROM_CHECK 0xf800

int checkHighScores() {
    // just check the configuration bits. If bytes 0 and 1
    // are not inverted copies of each other, assume no ubergrom
    // config space is always mapped, so we can just go ahead and read
    GROM_SET_ADDRESS(UBERGROM_CHECK);
    unsigned char a = UBERGROM_RD;
    unsigned char b = UBERGROM_RD;
    if (a != (~b)) return 0;
    return 1;
}

    def    checkHighScores
checkHighScores
    li   r1, >F800      hard coded address in GROM
* Begin inline assembler code
* 41 "/mnt/d/work/libti99ALL/grom.h" 1
    movb r1,@>9c02      load address (inline asm)
    swpb r1
    movb r1,@>9c02
    swpb r1
* 0 "" 2
* End of inline assembler code
    movb @>983C, r3     read from GROM (base 15 for Ubergrom config)
    movb @>983C, r2     and again for the next byte
    clr  r1             zero return code
    srl  r3, 8          make byte to int (>00aa)
    srl  r2, 8          make byte to int (>00bb)
    inv  r2             invert second word (>ffBB)
    c    r3, r2         test equal <--- bug? - the MSB won't match because of the invert
    jne  L4             skip if different
    li   r1, >1         else change return code to 1
L4
    b    *r11           back to caller
    .size    checkHighScores, .-checkHighScores

If I make them signed chars, then the promotion sign extends the values and the inv does the right thing. But I should be able to invert and test unsigned chars as well? (-Os and -no-function-cse)

This worked on the old compiler, interestingly. I had this code in gromcfg (doing essentially the same thing):

unsigned char GetConfigByte() {
    unsigned char x,y;

    x = GromReadData(0xF800, 15);
    y = GromReadData(0xF801, 15);
    y = ~y;
    if (x != y) {
        x=0;
    }

    return x;
}

And this definitely worked.

    def    GetConfigByte
GetConfigByte
    ai   r10, >FFFA         stack setup
    mov  r10, r0
    mov  r11, *r0+
    mov  r9, *r0+
    mov  r13, *r0+
    li   r13, GromReadData  store function address
    li   r1, >F800          get base address
    li   r2, >F00           get grom base index as a byte
    bl   *r13               call GromReadData to read a byte
    movb r1, r9             store x in r9
    li   r1, >F801          next byte from GROM
    li   r2, >F00           same base
    bl   *r13               call GromReadData
    inv  r1                 invert result
    cb   r9, r1             byte compare  <---- since we stayed working with bytes, it was fine
    jeq  L15                if equal, skip ahead
    clr  r9                 else invalid, so zero the byte
L15
    movb r9, r1
    mov  *r10+, r11
    mov  *r10+, r9
    mov  *r10+, r13
    b    *r11
    even

Why did it promote the comparison?

Tursi · January 28

Something to do with having the invert in the comparison. If I change the code to look like the older code with an explicit invert, it does the right thing:

int checkHighScores() {
    // just check the configuration bits. If bytes 0 and 1
    // are not inverted copies of each other, assume no ubergrom
    // config space is always mapped, so we can just go ahead and read
    GROM_SET_ADDRESS(UBERGROM_CHECK);
    signed char a = UBERGROM_RD;
    signed char b = UBERGROM_RD;
    b = ~b;  // <--- do it explicitly
    if (a != b) return 0;
    return 1;
}

	def	checkHighScores
checkHighScores
	li   r1, >F800
* Begin inline assembler code
* 41 "/mnt/d/work/libti99ALL/grom.h" 1
	movb r1,@>9c02
	swpb r1
	movb r1,@>9c02
	swpb r1
* 0 "" 2
* End of inline assembler code
	movb @>983C, r3   get a
	movb @>983C, r2   get b
	clr  r1           clear return
	inv  r2           invert b
	cb   r3, r2       <--- now it is a byte compare, code works
	jne  L4
	li   r1, >1
L4
	b    *r11
	.size	checkHighScores, .-checkHighScores

JasonACT · January 29

14 hours ago, Tursi said:

Why did it promote the comparison?

...and @khanivore this might be a bug in the version 4 compiler...

volatile unsigned char a;
volatile unsigned char c;

int ttt() {
    if (a != (~c)) return 0;
    return 1;
};

Compiles to this with the 4.3.3 ARM version of GCC that I use for an old device I write software for sometimes:

	.file	"test.c"
	.text
	.align	2
	.global	ttt
	.type	ttt, %function
ttt:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	ldr	r3, .L3
	ldrb	r2, [r3, #0]	@ zero_extendqisi2
	ldr	r3, .L3+4
	ldrb	r0, [r3, #0]	@ zero_extendqisi2
	mvn	r0, r0
	cmp	r2, r0
	movne	r0, #0
	moveq	r0, #1
	bx	lr
.L4:
	.align	2
.L3:
	.word	a
	.word	c
	.size	ttt, .-ttt
	.comm	a,1,1
	.comm	c,1,1
	.ident	"GCC: (GNU) 4.3.3"

If I'm not mistaken, it's doing the exact same thing. Those are not my comments - that's the original GCC ARM output.

Tursi · January 29

Interesting find! I went over to Godbolt.org to try and understand the code... and the first thing I noticed was the default compiler (x86-64 gcc trunk) returns a warning:

"warning: comparison of promoted bitwise complement of an unsigned value with unsigned [-Wsign-compare]"

That implies that this is known and not intended to function correctly? I have to admit that seems counter-intuitive to me, but if it's meant to be, perhaps it's not something we need to fix?

I've not often had to write code that thinks about the promotion, but of course if I cast the result of the invert, that also works.

if (a != (unsigned char)(~c)) return 0;

So it's clearly not a regression, the compiler literally thinks that the two forms are very different in intent.

((Edit: sorry if you read this live, I wrote it while testing and ended up editting a few times

))

Edited January 29 by Tursi

JasonACT · January 29

8 minutes ago, Tursi said:

perhaps it's not something we need to fix?

I'll be a hard one to fix, that's for sure. Probably what someone else thought when they added the warning

Tursi · January 29

5 minutes ago, JasonACT said:

I'll be a hard one to fix, that's for sure. Probably what someone else thought when they added the warning

Yeah, I think so. I can find people arguing about it all the back to 2008!

https://gcc.gnu.org/legacy-ml/gcc-bugs/2008-12/msg00046.html

JasonACT · January 29

4 hours ago, Tursi said:

Yeah, I think so. I can find people arguing about it all the back to 2008!

https://gcc.gnu.org/legacy-ml/gcc-bugs/2008-12/msg00046.html

I'm adding -Wall and -W to my compiles from now on (-W is apparently the same [but an old version of/] as -Wextra).

GCC for the TI

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members