Jump to content

insomnia

Members
  • Posts

    91
  • Joined

  • Last visited

  • Days Won

    1

Posts posted by insomnia

  1. Not to hype my own project (much), but my port of GAS has most of the parsing code required for an assembler written in C. It would take more work to split that stuff off into a complete assembler, but I imagine it wouldn't be too bad.

     

    If I have time after working on my other stuff, making an assembler library for the other IDE-like projects out there might be an interesting. You can never have too many tools, right?

  2. Tursi:

     

    The +4 bug can be found at gcc/config/tms9900/tms9900.md line 1023

     

    This is in the addhi3 recipe, and the error gets invoked if a 16-bit ADD operation is called for when one of the addends has a constant integer value of four.

     

    (define_insn "addhi3"
     [(set (match_operand:HI 0 "general_operand" "=rR>,Qi,r,rR>,Qi")
    (plus:HI (match_operand:HI 1 "general_operand" "%0,0,0,0,0")
    	 (match_operand:HI 2 "general_operand" "rR>LMNOP,rR>LNMOP,i,Q,Q")))]
     ""
     "*
    {
     if (GET_CODE (operands[2]) == CONST_INT)
       {
         if (INTVAL(operands[2]) == 1)
    return \"inc %0\";
         else if (INTVAL(operands[2]) == -1)
           return \"dec %0\";
         else if (INTVAL(operands[2]) == 2)
           return \"inct %0\";
         else if (INTVAL(operands[2]) == -2)
           return \"dect %0\";
         else if (INTVAL(operands[2]) == 4)
           return \"c *%0+ *%0+\";
         else
           return \"ai %0, %2\";
       }
    
     return \"a %2, %0\";
    }"
     [(set_attr "length" "1,2,2,2,3")])
    
    

     

    The existing line:

    return \"c *%0+ *%0+\";

    Should be:

    return \"c *%0+, *%0+\";

     

    I confirmed your results with the wrong load address for the return pointer, but I've been busy with work and don't have a fix yet.

     

    The prologue and epilogue code is located in gcc/config/tms9900.c in functions tms9900_expand_prologue and tms9900_expand_epilogue. I'm looking into these now.

     

    From the reaction here, I really should put the register lifetime changes on hold for now and crank out a new patch and procedures. Stay tuned.

     

    Matthew:

    I agree that professional programmers need to be aware of the limitations of any floating point implementation (even decimal floats), but remember that most people here are probably not professional programmers. Floating-point numbers are sometimes just the most direct tool available to solve a problem.

     

    Honestly, I've only had a few projects where floats were appropriate. Usually fixed-point notation was the better choice due to precision, speed or size concerns. But using fixed-point numbers requires even more knowledge of the numbers which will be used. It's an easy way to get overflows, quantisation errors or other nonsensical results as the data's required precision butts up against the limits of the representation.

     

    I just started looking at floats because I thought TI's implementation ate a lot of memory, and I thought I might be able to do better now.

  3. First off, thanks for your interest in this. Even though I'm trying to be thorough, mistakes happen, and it's great to get a second look at things.

     

    I've seen the missing comma bug you found. It's the result of a typo in the tms9900.md file for a peephole optimization. That fix will be in the next patch. Here are some other things that will make that patch:

     

    Divide and modulus operations now merged when possible

     

    Fix data symbol declarations, now TI compliant (The current code worked fine with the modified binutils assembler, but was invalid for any other assembler)

     

    Fix "+=4" form, was missing comma in emitted code (Your first bug)

     

    Fix alignment of code, in some cases it was possible to misalign code by using odd-length string constants

     

    Your second problem (different save and restore addresses) really worries me. My GCC environment is currently in shambles and in no condition to confirm or fix what you found.

     

    I'm also going to have to stall on the build procedures. I've tried to make sure I didn't skip any steps, but apparently I did. Once I get my environment back together I can get a better procedure out.

     

    In my absence, I've been going in a couple different directions:

     

    Working on a libc library for the TI

     

    Playing around with the idea of reimplementing Basic calls in C form to ease the transition from basic to C for basic-only programmers.

     

    Research and framework for a IEEE-compliant floating point library

     

    Research on 16-bit floating point representation (goggle for "half" floating point format)

     

    And the big time sink: restructure the register lifetime calculation in GCC to allow for the faster INV and ABS comparisons if the test value can be safely destroyed. This is turning out to be a lot more work that I expected, and is the reason my GCC environment is messed up. It's currently full of test and debug code which breaks when run against general-purpose code.

     

    I guess the short version of all this is "I really appreciate your working with my code, and hang on a bit, help is on the way"

  4. How about computed jumps?

     

    I was thinking of ways to take advantage of the X instruction, and this is what I came up with

     

    In basic:

    on index goto 100, 110, 120, 130
    

     

    in assembly:

     

    * Assume "index" is stored in R0
    * Validate input value
     ci r0, 4
     jh badval     * Index is negative or greater then 4, bad value
    
    * Jump to correct line
     ai r0, >1000  * This is the code for "JMP 0"
     x  r0         * Jump into the table below
     jmp line100
     jmp line110
     jmp line120
     jmp line130
    

     

    Yeah, this is pretty much what X is for, but it's an easy instruction to overlook.

    • Like 1
  5. I often use ABS, CLR, INV, SETO for simple on/off flags. Sometimes STWP for a quick 'set' if I don't care about the value in the register. All except STWP can be used on word values equally well, though a little slower.

     

          SETO R6      * set flag 
          ABS  R6       * test flag
          JEQ  SETFLG   * jump if EQ it set
          CLR  R6       * clear flag ... or we could use INV R6 assuming we used SETO/CLR
    
    * Flip the flag:
          INV R6
          ABS  R6   * is it on or off?  
    
    * sometimes I'll "set" the flag to nonzero like this - 8 cycles IIRC
          STWP R6   * always !=0, but INV can't be used to turn flag 'off'
    
    

     

    He he, I never thought about using ABS. I would have used MOV R6,R6 in your example above. Sets the EQ bit if 0 :-)

    I totally agree. In fact I'm shamelessly stealing the ABS test for a GCC optimization step. Every cycle counts right?

  6. I don't know if this is a trick or not, but I thought this was kind of neat.

     

    By using the BLWP instruction, you can have overlapping workspaces between the caller and callee. This allows parameters to be passed to the callee, but preserve some register values across the function call. This would be sort of like a "caller save" calling convention, where the caller saves all important info before calling a function.

     

    Example: Save four registers, then call a function which takes three arguments and returns a value
    
    Caller regs:              0 1 2 3 4 5 6 7 8 9 A B C D E F
    Callee regs:  1 2 3 4 5 6 7 8 9 A B C D E F '--.--' '-.-'
                                   | | | '-.-'    |      |
    Arg3        --------------------' | |   |      |      |
    Arg2        ----------------------' |   |      |      |
    Arg1/Return ------------------------'   |      |      |
    Callee context    ----------------------'      |      |
    Caller saved regs -----------------------------'      |
    Caller context    ------------------------------------'
    

     

    In this example, R9 to R12 are saved across the call, the caller places arguments in R3, R4 and R5. A return value will be placed in R5 after the return.

     

    The caller registers R0 to R8 are destroyed by the call, but the callee has nearly all registers available for its use.

     

    * Argument setup
     li   r3, 1
     li   r4, 2
     li   r5, 3
    
    * Call setup:
     stwp r6                 * 8     * Get current workspace
     ai   r6, -(4+3)*2       * 14+4  * Locate caller workspace 14 bytes above this one
     li   r7, FUNC           * 12+4  * Set jump address
     blwp r6                 * 26    * Jump to called function
                             * ----
                             * 68
    
    * Return value in R5
    

     

    Even though BLWP is slower than BL, this is pretty quick. The call convention I came up with for my GCC port takes about 100 cycles for a call setup, return and stack maintenance. Suprisingly, this is faster.

     

    I don't know of any other archetecture where you can do something like this, so no compiler would support this kind of call. Assembly only for this guy. There are some other obvious drawbacks:

     

    The author would be restricted to using certain registers for certain purposes. The register restrictions could change from call to call. Making tricky assembly code REALLY confusing.

     

    The small amount of scratchpad memory restricts the "stack" usage, and call tree depth. Wrapping around the top of scratchpad memory would result in hard-to-find memory errors.

     

    I'm not sure if "blwp r6" is valid or not to be honest. If not, that's OK, it just means the call setup needs a few additional instructions.

     

    Honestly, I don't think this is a reasonable call method for a general case, but it is awfully cool.

  7. Well, it's time to release a new set of patches. This time I've made sure they are usable (I'm still upset about that)

     

    Anyway, here's the bugs that got fixed for the GCC port:

     

    1) Convert decimal constants to hex

    This was a request a while back, and helps make the emitted assebly a bit easier to read.

     

    2) Add "C *Rx+, *Rx+" optimization for 16-bit "+=4" operations

    Yes, I shamelessly stole this from nouspiekel.com

     

    3) Fix non-volatile register allocation

    There was a bug in the allocator, which prevented non-volatile registers from ever being used. Instead, the stack was always used to store non-volatile values.

     

    4) Optimize call pro/epilogue, make stack usage consistant

    5) Saved registers at bottom of stack frame

    There were some notes about this in earlier posts. I also noticed that in some cases, the saved registers would be pushed onto the stack first. In other cases, they would be pushed last. This didn't affect the functionality of the code, but it did make things confusing.

     

    6) R11 no longer falsely marked as non-volatile register

    R11 was considered a non-volatile register. but it makes more sense to consider it as non-volatile. Think about it this way: If a called function itself calls a function, R11 (the return pointer) will point to the address after the BL instruction. In order to treat R11 as a true non-volatile register, it would have to be saved and restored around all function calls on the caller side. This seems dumb. R11 is now considered non-volatile, but is saved in the function prologue, and restored before the function exit.

     

    7) The last argument for functions like f(long a, long b, int c, long, d) were lost

    The register allocator did not properly treat 32-bit quantities (stored in two consecutive registers), so the last argument, which would require more registers than are available, was lost. The allocator is smarter now.

     

     

     

     

    Building from source

    ====================

     

    Obtaining the base packge

    ---------------------

    The GCC port to the TMS9900 provessor is based on GCC 4.4.0. It can be obtained

    from gcc.gnu.org. THis version of GCC was released in April of 2009, and many ve

    rsion have since been released. There is a good possibility that the TMS9900 pat

    ch will work with a newer release, but this has not been tested. Proceed at your

    own risk.

     

     

    Patching for the TMS9900

    ------------------------

    Once the source archve has been downloaded, extract the files to the directory

    of your choosing. It is recommended that this be a different place than the

    directory in which you intend the resulting binaries to be loacted. This will help

    eliminate confusion, and make upgrading to later versions simpler.

     

    Once the archinve has been expanded, run this command from the top-level

    directory of gcc:

    patch -p1 < gcc-4.4.0-tms9900-1.1.patch

     

     

    Building GCC

    ------------

    GCC will be configured to build only the C compiler. The other GCC languages are

    available, and may work, but have not been tested. Object-oriented languages,

    like C++ or Java are not likely to work since their constructores will be placed

    in a .init section which will require additional processing by the linker.

    Execute these commands to build and install the GCC tools. Be sure to replace

    INSTALLDIR with your intended installation path.

     

    ./configure --prefix INSTALLDIR --target=tms9900 --enable-languages=c

    make all-gcc

    make install

     

     

    Patching Binutils

    -----------------

    Binutils is patched using the same steps as GCC.

     

     

    Building Binutils

    -----------------

    The build process for Binutils is more involved than for GCC.

    Execute these commands to build and install the GCC tools. Be sure to replace

    INSTALLDIR with your intended installation path.

     

    $ ./configure --prefix INSTALLDIR --target tms9900

    $ cd bfd

    $ make all

    $ cd ..

    $ make all

    $ make install

    binutils-2.19.1-tms9900-1.0-patch.tar.gz

    gcc-4.4.0-tms9900-1.1-patch.tar.gz

    • Like 1
  8. A couple of points here (bear with me, I haven't figured out how to quote properly on this forum yet...):

     

    GCC optimization levels

    -----------------------

    The optimizer doesn't really know how to generate fast code. It can optimize for size, algorithmic complexity, combine constants, and move constant expressions out of loops, inlining. You get the idea. These things will typically generate fast code, but this is not guaranteed. Also, GCC has no way of counting cycles or taking advantage of the speed of the scratchpad memory.

     

     

    Scratch pad memory usage

    ------------------------

    The compiler uses a single register set. It doesn't know where in memory this set lies. It is the responsibility of the C environment initialization code. This is ususally a bit of assembly code. On other machines, this can be quite involved, potentially involving lots of driver or memory initialization. I'm anticipating that for the TI, I'll only need to initialize R10 (used for the stack pointer), set the workspace register to some convenient location and clear the BSS section. If performance demands that the register set be at a specific location, that can be handled in the initialization code. If scratchpad is to be used for memory, that can be done in C by specifying data pointers in that space. If code must be run from there, that can be done in C as well (use function pointers to scratchpad). The point is that these things are beyond the scope of converting C code to assembly.

     

     

    Calling convention

    ------------------

    In my port, function arguments are primarily passed by register. If there is not enough room for all the arguments, the rest are passed on the stack. I'm still finalizing the order of these arguments on the stack. Likewise, function-local variables are stored in registers if there is room. GCC can determine the lifetime of these variables, and can use the same register to hold many different variables, as long as they are not needed at the same time. Once all registers are used, the remaining variables are stored on the stack.

     

    The called function is responsible for preserving a small set of registers if they are to be used. The calling function assumes that these registers will not be modified by the act of calling a function. Values in the other registers are assumed to be destroyed by calling a function.

     

    The short version: R1 through R6 are used to pass function arguments, R1 is used to return values to the caller. R9 to R11, R13 to R15 must not be changed by calling a function. All other registers can be modified in any way.

     

    If assembly modules hold to the calling convention, things will run smoothly. If not, expect random behavior and crashes.

     

     

    "Taboo" registers

    -----------------

    This is dependant on the calling convention. For my port, any register may be used, but the values in six registers (R9, R10(stack pointer), R11 (return pointer), R13, R14, R15 (These are needed to return from a BLWP instruction) ) must not be changed when an assembly module returns. This can be done by saving these values in memory and restoring then to registers before the module exits. Or most simply by not using these registers.

     

     

    Accessing values on the stack

    -----------------------------

    The stack grows towards zero, so all stack variables can be accessed by indexing off the stack pointer register. This also allows more flexability when calling functions, since fixed addresses are not used.

     

     

    The general idea is that by using the C language, functional code can be written quickly, and code can be ported more easily from other machines. If more control is needed in how code is implemented (either for size, speed, memory usage or other requirements), assembly can be used.

    I think that's it for the points raised so far. I intend to make a document describing all this in more detail. See the earlier posts for more information.

    • Like 2
  9. The big advantage that the 68000 has over the TMS9900 are the load and store multiple instructions. The post-decrement modes are handy, but clever coding can mostly compensate for that. Dedicated push and pop instructions would add no real advantage beyond a pre-decrement mode. There are MOV modes which can do the push and pop in mostly one instruction.

     

    I've been thinking about stack-y things lately, given the comments here, and decided to second guess my earlier decisions. As a result, I came up with nine different stack call conventions. Most were pretty poor or offered no real advantage over what I already had. On the upside, I was able to squeeze out two bytes and 12 cycles from my earlier design. The four most interesting ones are listed below.

     

    One thing I've found out is that the TMS9900 really makes you decide if you want to optimize for speed or size. Apparently, you can't have both. I recall that one of the criticisms of the TI firmware was that it was heavily optimized for size, making it a bit slow. This is a bit annoying since with RISC architectures (the TMS9900 mostly qualifies), fast and small are usually the same thing. I blame the slow-as-dirt 8-bit memory.

     

    At any rate, I wouldn't be surprised if the TMS9900 doesn't turn out to be some kind of computing powerhouse.

     

    So here's what I found stack-wise: (note that the form numbers here don't correspond to the ones I mentioned earlier)

     

    This is the simplest stack setup and teardown code you can have. All this does is save R11 (the return pointer) at the start of a function, and restore it at the end.

     

    Form 0 (the simplest R11 save possible)
    
    code                 bytes  cycles
    ------------         -----  ------
    dect sp              2      10+8     = 18
    mov r11, *sp         2      14+8+4+8 = 34
    ...
    (function goes here)
    ...
    mov *sp+, r11        2      14+8+8+8 = 38
    
    Totals              =6               = 90
    
    

     

    These forms assume local variables in a stack frame, and non-volatile register values stored on the stack

     

    Form 1
    code                        bytes  cycles
    ------------                -----  ------
    ai sp, -regsize-framesize   4      14+8+8   = 30
    mov r11, *sp                2      14+8+4+8 = 34
    mov r9,  @2(sp)             4      14+8+8+8 = 38
    mov r13, @4(sp)             4      14+8+8+8 = 38
    mov r14, @6(sp)             4      14+8+8+8 = 38
    mov r15, @8(sp)             4      14+8+8+8 = 38
    ...
    (function goes here)
    ...
    mov r11, *sp+               2      14+8+8+8 = 38
    mov r9,  *sp+               2      14+8+8+8 = 38
    mov r13, *sp+               2      14+8+8+8 = 38
    mov r14, *sp+               2      14+8+8+8 = 38
    mov r15, *sp+               2      14+8+8+8 = 38
    ai sp framesize             4      14+8+8   = 30
    
    Totals                    =36               =436
    
    
    

     

    Form 2
    code                        bytes  cycles
    ------------                -----  ------
    ai sp, -regsize-framesize   4      14+8+8   = 30
    mov sp, r0                  2      14+8     = 22
    mov r11, *r0+               2      14+8+8+8 = 38
    mov r9 , *r0+               2      14+8+8+8 = 38
    mov r13, *r0+               2      14+8+8+8 = 38
    mov r14, *r0+               2      14+8+8+8 = 38
    mov r15, *r0                2      14+8+8+8 = 34
    ...
    (function goes here)
    ...
    mov *sp+, r11               2      14+8+8+8 = 38
    mov *sp+, r9                2      14+8+8+8 = 38
    mov *sp+, r13               2      14+8+8+8 = 38
    mov *sp+, r14               2      14+8+8+8 = 38
    mov *sp+, r15               2      14+8+8+8 = 38
    ai sp, framesize            4      14+8+8   = 30
    
    
    Totals                    =30               =458
    
    
    

     

    Form 2a (requires a stack frame)
    code                        bytes  cycles
    ------------                -----  ------
    ai sp, -regsize-framesize   4      14+8+8   = 30
    mov sp, r0                  2      14+8     = 22
    mov r11, *r0+               2      14+8+8+8 = 38
    mov r9 , *r0+               2      14+8+8+8 = 38
    mov r13, *r0+               2      14+8+8+8 = 38
    mov r14, *r0+               2      14+8+8+8 = 38
    mov r15, *r0                2      14+8+8+8 = 34
    ...
    (function goes here)
    ...
    mov *sp+, r11               2      14+8+8+8 = 38
    mov *sp+, r9                2      14+8+8+8 = 38
    mov *sp+, r13               2      14+8+8+8 = 38
    mov *sp+, r14               2      14+8+8+8 = 38
    mov *sp,  r15               2      14+8+8+8 = 34  <-- this line is different, no post-increment, saved 4 cycles
    ai sp, framesize+2          4      14+8+8   = 30
    
    Totals                    =30               =454
    
    
    

     

    Nvols  used:       0   1   2   3   4
                     --  --  --  --  --
    Form0  bytes:      6
          clocks:    90
    
    Form1  bytes:     12  18  24  30  36   Use for one or less non-vol regs saved
          clocks:   132 208 284 360 436
                    ^^^^^^^
    Form2  bytes:     14  18  22  26  30   Use for two or more non-vol regs saved
          clocks:   154 230 306 382 458
                            ^^^^^^^^^^^
    Form2a bytes:     14  18  22  26  30   Use for two or more non-vol regs saved plus stack
          clocks:   150 226 302 378 454
    
    If no stack frame is required, four bytes and 30 cycles can be saved from forms one and two
    

     

    Exciting! Who isn't thrilled by stack frames?

     

    This is one form that looked interesting, but totally impractical for a compiler. Although it might be handy in a some assembly library or something.

     

    The idea is that most of the stack setup and teardown is in a callable function. This reduces the per-function byte count at the expense of speed. Lots of speed. I'm mostly worried about the "b @8(r11)" instruction, I'm not sure if that's legal or will do what I expect. I haven't tested this, and it may not work, but it sure looks interesting.

     

    Form 3
    code                                  bytes  clocks
    ------------                          -----  ------
    ai sp, -(regs*2+framesize)            4      14+8+8   = 30   <-- allocate space on the stack
    mov r11, r0                           2      14+8     = 22   <-- make copy of return pointer
    bl @(store_multiple+16-regs*4)        4      12+8+8   = 28   <-- jump into common stack setup
    ...
    (function goes here)
    ...
    li r0, framesize+regs*2               4      12+8+8   = 28   <-- size of stack allocated
    b @(load_multiple_return+16-regs*4)   4      8+8+8    = 24
      <-- jump into common stack teardown then return
    
    store_multiple:
      mov r15, @-8(sp)         4      14+8+8+8 = 38  <-- entry point for four registers saved
      mov r14, @-6(sp)         4      14+8+8+8 = 38  <-- entry point for three registers saved
      mov r13, @-4(sp)         4      14+8+8+8 = 38  <-- entry point for two registers saved
      mov r9,  @-2(sp)         4      14+8+8+8 = 38  <-- entry point for one registers saved
      mov r0, *sp              2      14+8+4+8 = 34  <-- entry point for zero registers saved
      b @8(r11)                2      8+4      = 12  <-- return to just after "bl @..." call
    
    load_multiple_return:
      mov @-8(sp), r15         4      14+8+8+8 = 38  <-- entry point for four registers saved
      mov @-6(sp), r14         4      14+8+8+8 = 38  <-- entry point for three registers saved
      mov @-4(sp), r13         4      14+8+8+8 = 38  <-- entry point for two registers saved
      mov @-2(sp), r9          4      14+8+8+8 = 38  <-- entry point for one registers saved
      mov *sp, r11             2      14+8+4+8 = 34  <-- restore link pointer
      a sp, r0                 2      14+8     = 22  <-- restore stack
      b *r11                                         <-- return to caller
    
          nvols:      0   1   2   3   4
    Form3  bytes:     18  18  18  18  18
          clocks:   234 310 386 462 538
    
    

     

    The TMS9900 has a simple instruction set, but there are still some neat things that can be done with it. So now, I'm off to implement what I've shown here, and fix more compiler bugs.

    • Like 1
  10. Example Code

    ============

     

    /* This is some undefined function. Declared extern to prevent inlining */
    extern test2(int *b2);
    
    /* Test function called by main, uses arguments on registers and the stack */
    int test(long a1, long a2, long a3, long a4)
    {
     int array[10];  /* Local variable stored on the stack */
     test2(array);
     return(7);
    }
    
    main()
    {
     test(1,2,3,4);
    }
    

     

    Output, as compiled with -O1 optimisation, comments added by hand

     

    pseg
    
    def	test
    test:
           ****************************
           * Function prologue
    ai r10, -22         * Allocate space for saved regs and local vars
                               *    sizeof(array) + sizeof(R11) = 22
    mov r11, @20(r10)   * Save the link register (R11) to the stack
    
           ****************************
           *   test2(array);
    mov r10, r1         * Argument 1 in R1: &array[0]
    bl @test2           * Call test2
    
           ****************************
           * return(7)
    li r1, 7            * Return value on R1: >0007
    
           ****************************
           * Function epilogue
    ai r10, 20          * Free local variables from the stack 
    mov *r10+, r11      * Restore R11
    b *r11              * Return to caller
    
    def	main
    main:
           ****************************
           * Function prologue
    ai r10, -6          * Allocate space for saved regs and arguments
                               *   sizeof(a4) + sizeof(R11) = 6
    mov r11, @4(r10)    * Save the link register (R11) to the stack
    
           ****************************
           *   test(1,2,3,4);
    clr *r10            * Argument 4 on stack: >0000 >0004 = 4
    li r1, 4
    mov r1, @2(r10)
    
    clr r1              * Argument 1 in R1, R2: >0000 >0001 = 1
    li r2, 1
    
    mov r1, r3          * Argument 1 in R3, R4: >0000 >0002 = 2
    li r4, 2
    
    mov r1, r5          * Argument 1 in R5, R6: >0000 >0003 = 3
    li r6, 3
    bl @test            * Call test
    
           ****************************
           * Function epilogue
    ai r10, 4           * Free local variables from the stack 
    mov *r10+, r11      * Restore R11
    b *r11              * Return to caller
    
           ref	test2
    

     

    Stack usage for this example. Assume the stack pointer is assigned to >2000 in the initialization code.

     

     Usage of the stack memory:
    
     >2000
     ~~~~ position at start of main() ~~~~
     >1FFE  --- R11 (link pointer) from main
     >1FFC  -.- Argument 4 for test()
     >1FFA  -'
    
     ~~~~ position at start of test() ~~~~
     >1FF8  --- R11 (link pointer) from test
     >1FF6  -.- array from test
     >1FF4   |
     >1FF2   |
     >1FF0   |
     >1FEE   |
     >1FEC   |
     >1FEA   |
     >1FF8   |
     >1FF6   |
     >1FF4  -'  <-- array[0]
     ~~~~ position at start of test2() ~~~~
    

  11. The Stack

    =========

    A stack is required of any C language implementation. It is used to storee local values, pass arguments, and record the position in a call tree. The current stack position is stored in R10, and grows towards zero as more stack is used. This position must be initialised before C code may be executed.

     

     

    Push And Pop

    ------------

    The TMS9900 does not have push or pop operations, but it does have post-increment MOV forms which work well for stack pop operations.

     

    Push operations take one of two forms, depending on the amount of data to push (Remember that R10 is used for the stack pointer).

     

    Form 1:
     Instructions            Bytes  Cycles  
     ---------------         -----  ------    
     ai r10, -regcount*2      4      14+0
     mov r0, *r10+            2      14+8
     mov r1, *r10+            2      14+8
     ...
     ai r10, -regcount*2      4      14+0
    
     In general: bytes = 8+2N  : 10,12,14,16, 18, 20
                 cycles= 28+22N: 28,50,72,94,116,138 
    
    Form 2:
     Instructions            Bytes  Cycles  
     ---------------         -----  ------    
     ai r10, -regcount*2      4      14+0
     mov r0, *0(r10)          4      14+8
     mov r1, *2(r10)          4      14+8
     ...
    
     In general: bytes = 4+4N  :  8,12,16,20, 24, 28
                 cycles= 14+22N: 14,36,58,80,102,124
    

     

    Form one is slightly slower, but results in smaller code when three or more registers are to be pushed. For the TMS9900 port, form two is only used when two or fewer registers are to be pushed.

     

     

    Saving Non-Volatiles

    --------------------

    Non-volatile registers are saved by the callee as part of the function preamble. The saved value is then restored to the register before the function returns.

     

     

    Local variables

    ---------------

    Local variables are stored on the stack after the stored non-volatile registers.

     

     

    Call convention

    ===============

     

    Arguments In Registers

    ----------------------

    In order to make the call overhead as low as possible, R1 through R6 are used to pass arguments to called functions. Since two registers are used for 32-bit values, this can result in three to six arguments that can be passed on registers. All additional arguments are passed on the stack.

     

     

    Arguments On The Stack

    ----------------------

    Arguments which cannot be passed by register are pushed onto the stack before the function is called.

     

     

    Saved Registers

    ---------------

    Register values which are to be saved are pushed onto the stack after the arguments.

     

    Only R9, R11, R13, R14 and R15 are eligible to be saved. R10 (the stack pointer) is not itself saved. This is done to save stack space. Each function knows what it placed on the stack, and can unwind the stack for itself. This also saves the code that would be needed to explicitly save and restore the stack pointer.

     

    A consequence of this design is that an external debugger cannot determine a call tree by examining the stack. Since no debuggers are in common use for this architecture, this is not seen as a major drawback.

     

     

    Local Variables

    ---------------

    If local variables are used in a function, volatile registers are first allocated for the task. Once those are depleted, non-volatile registers are used. If those are depleted as well, the stack is used. Local variables which are referenced by address are always stored on the stack.

     

    If the stack is used for local variables, they are pushed onto the stack after the saved registers.

     

     

    Return Values

    -------------

    Sixteen-bit values are passed on register R1. Thirty-two-bit values are returned on registers R1 and R2.

    • Like 1
  12. Thanks for the support everyone. Work has been eating all my time lately, so I haven't had a chance to post anything.

     

    I've been working on notes for the GCC port, and found a few minor bugs, so no patches yet. Sorry.

     

    I do however have some notes and more sample code that people have been asking for.

     

    Primitive Data Types

    ====================

    These are four primitive data types supported by the TMS9900 patch. These types may be stored in registers. Larger data types must be stored in memory, and accessed via pointer dereference.

     

     Name   Size in bytes
     -----  -------------
     char   1
     int    2
     short  2
     long   4
    

     

    Byte quantities are stored in the high byte of a register, this is to accomodate the byte-oriented instructions. Also, this allows tests on signed quantities to behave as expected. Conversion to and from larger types should be rare, so it is best to take advantage of the mechanisms the hardware provides.

     

    Two byte quantities are stored in a single register. This is the most convenient data type to use, as the word-oriented instructions like INC, INCT, DEC, or DECT may be used to improve size and performace of compiled code. The "int" type is intended to be the same size as the machine word, so it is defined as a two-byte quantity.

     

    Four byte quantities are stored in consecutive registers in big-endian format. The lower numbered register contains the most significant bytes word, and the higher numbered regiter contains the least significant bytes.

     

    It is important to realize that not all 32-bit operations are supported by the compiler. This is intentional. Some 32-bit operations, like division, are quite involved on the TMS9900, and if supported by the compiler would result in large pieces of inlined code. It is more efficient for these operations to be handled by an external library. Only simple operations, like addition, subtraction and conditional checks are inlined by the compiler.

     

    Register Usage

    ==============

    The registers are primarily separated into volatile and non-volatile groups. The volatile registers are not preserved when a function is called, and values stored there may be destroyed as a result. Non-volatile registers are preserved across function calls. Care must be taken to preserve and restore the values stored in non-volatile registers if they are to be used.

     

    A certain number of registers are assigned special functions by the hardware. The volatility of these registers has been chosen to reflect the most convenient usage model.

     

    Since the TMS9900 only support two-argument instructions, most interesting calculations will involve many intermediate registers. This has resulted in the decision to classify a large number of registers as volatile. Additionally, since the registers live in memory, non-volatile values can usually be stored in memory without much penalty.

     

    In order to reduce the overhead of a function call, it is desired that the return values and as many arguments as possible are passed by register. These argument register must be contiguous to allow the possibility of 32-bit arguments to be used.

     

    Finally, to better remember the volatility of each register, there should be a line with all volatile registers on one side, and all non-volatile registers on the other.

     

    The register convention has been chosen to try to find a best fit for these constraints and desires.

     

    R0 - Volatile, Bit shift count

    R1 - Volatile, Argument 1, return value 1

    R2 - Volatile, Argument 2, return value 2

    R3 - Volatile, Argument 3

    R4 - Volatile, Argument 4

    R5 - Volatile, Argument 5

    R6 - Volatile, Argument 6

    R7 - Volatile, Argument pointer

    R8 - Volatile, Frame pointer

    R9 - Preserved across BL calls

    R10 (SP) - Preserved across BL calls, Stack pointer

    R11 (LR) - Preserved across BL calls, Return address after BL

    R12 (CB) - Volatile, CRU base

    R13 (LW) - Preserved across BL calls, Old workspace register after BLWP

    R14 (LP) - Preserved across BL calls, Old program counter after BLWP

    R15 (LS) - Preserved across BL calls, Old status register after BLWP

     

    16-bit values are returned on R1, 32-bit valuesa are returned on R1,R2.

     

    Up to six arguments may be passed by register using R1 through R6. If not used for arguments, these are available as volatile registers.

     

    The argument pointer in R7 is not always used. If used, it points to the start of arguments passed on the stack. However, the compiler can usually calculate the location relative to the stack pointer, freeing R7 for general use.

     

    The frame pointer in R8 is not always used. If used, it points to the start of local values stored on the stack. However, the compiler can usually calculate the location relative to the stack pointer, freeing R8 for general use.

     

    The stack pointer is stored in R10. By definition, this is a non-volatile register. Called functions, if they manipulate the stack, must restore the stack pointer to the original value before it returns.

     

    The use of the remaining registers all have special uses defined by the hardware.

     

    R0 is used by the shift instructions, and my be called at any depth in the call tree. It must be volatile.

     

    The CRU base register is assigned to R12. This is volatile since all CRU uses will have to be in an assembly module. Since CRU operations are awkward enough as it is, why compound the confusion by bringing the stack into the picture?

     

    R11, R13, R14 and R15 store values required to return from a BL or BLWP call. These values must be preserved across function calls. If not used for a return from BL or BLWP, these are treated as non-volatile registers.

    • Like 1
  13. Newbie mistake here. During the course of compiling build instructions, I realized that the patch files included above are no good.

     

    I mixed source and destination while running the diff command. Sorry.

     

    But the complete files are still valid. I'll fix this ASAP, and include patch and build instructions, as well as register usage and other helpful info.

     

    While I'm at it, I'll convert to hex constants in the assembly.

  14. Absolutely brilliant ! Marvellous !

     

    Please use equates or something in assembler code to make it a bit more readable. Like having @VDPWA instead of @-29694.

     

    :thumbsup:

     

    That would be tricky. By the time GCC gets to assembly output, all constant expressions are calculated out to just a number. In order to use equates, a post-processing stage would need to be run to collect commonly-used values. Tricky and potentially misleading if two constant values have an equal value, but are used for different reasons. On the other hand, hex constants would probably be easier to follow than the decimal ones which are used now.

  15. I can understand your indecision, but you seem to have a talent for technical writing. Also, consider the number of views of your assembly thread. For each reply, there are almost ten views, which is pretty good considering the size of the TI assembly crowd.

     

    I say, if this is something you would enjoy doing, go for it. Even if there aren't millions of copies waiting to be sold, you would have the satisfaction of pointing at YOUR book on the shelf at the end of the day, and may inspire TI Basic programmers to get to know their machine a bit better.

  16. Hey everybody,

    I've been lurking here for a while, and thought this group might be interested in a port I've made of GCC and the GNU Binutils tools (gas, ar, ln, etc.).

    I've done some testing with simple programs, and things seem stable enough for other people to play with the compiler if they're interested. Since my development machine runs Linux, it seems like the best way to distribute the required patches is via source. I can provide the binaries, but since it seems like everybody here runs Windows, that might not be too useful. But if I'm wrong, let me know.

    The compiler outputs are ELF format object files, so I've made tools to convert to cartridge and EA5 binary formats for the TI. The intermediary assembly files are TI compatible, so they could be fed into your assembler of choice if you prefer. GAS also uses TI compatible files, but provides some useful exensions (long label names being one of the more important ones).

    I'm still putting a development site together at insomnialabs.blogspot.com, which has links to the GCC and Binutils patches, but other tools I have are not yet available for download (I'm working on it..).

    I'm attaching a "hello world" program as a teaser using GCC, GAS, LN, and the elf-to-cart converter.

    So, example code (included in the attachment):

    
    #define VDP_READ_DATA_REG (*(volatile char*)0x8800)
    
    #define VDP_WRITE_DATA_REG (*(volatile char*)0x8C00)
    
    #define VDP_ADDRESS_REG (*(volatile char*)0x8C02)
    
    
    
    #define VDP_READ_FLAG 0x00
    
    #define VDP_WRITE_FLAG 0x40
    
    #define VDP_REG_FLAG 0x80
    
    
    
    
    
    static void vdp_copy_from_sys(int index, char* src, int size)
    
    {
    
    volatile char* end = src + size;
    
    VDP_ADDRESS_REG = index | VDP_WRITE_FLAG;
    
    VDP_ADDRESS_REG = (char)(index >> ;
    
    
    
    while(src != end)
    
    VDP_WRITE_DATA_REG = *src++;
    
    }
    
    
    
    
    
    void main()
    
    {
    
    // 12345678901234
    
    vdp_copy_from_sys(0, "HELLO WORLD!",12);
    
    vdp_copy_from_sys(32, "THIS IS LINE 2", 14);
    
    while(1);
    
    }
    
    

    The resulting code after -O2 optimization:

    
    pseg
    
    LC0
    
    text "HELLO WORLD!"
    
    byte 0
    
    even
    
    LC1
    
    text "THIS IS LINE 2"
    
    byte 0
    
    even
    
    
    
    def main
    
    main
    
    li r1, 64 * 256
    
    movb r1, @-29694
    
    clr r1
    
    movb r1, @-29694
    
    li r1, LC0
    
    L2
    
    movb *r1+, r2
    
    movb r2, @-29696
    
    ci r1, LC0+12
    
    jne L2
    
    li r1, 96 * 256
    
    movb r1, @-29694
    
    clr r1
    
    movb r1, @-29694
    
    li r1, LC1
    
    L3
    
    movb *r1+, r2
    
    movb r2, @-29696
    
    ci r1, LC1+14
    
    jne L3
    
    L8
    
    jmp L8
    
    

    I was originally doing the port as a convenience for myself, so things may not be as well documented as they should. Let me know what you think.

     

     

    --------------------------------

    Updated Aug 18, 2014:

     

    The build information and other handy info was scattered throughout this thread, making it hard to get the latest stuff. I'll make sure to keep this post updated to make things easier to find.

     

    Manually building Binutils:

    (from top level of source tree)

    $ patch -1 < BINUTILS_PATCHFILE

    $ ./configure --target tms9900 --prefix INSTALL_DIRECTORY --disable-build-warnings
    $ make all
    $ make install

    Manually building GCC:

    (from top level of source tree)

    $ patch -1 < GCC_PATCHFILE

    $ mkdir build
    $ cd build
    $ ../configure --prefix INSTALL_DIRECTORY --target=tms9900 --enable-languages=c
    $ make all-gcc all-target-libgcc
    $ make install

    Building Binutils and GCC using install script:

    $ install.sh INSTALL_DIRECTORY

     

     

    Binutils Changelog

    -------------------------

    1.5
    Released 2013-05-01
    Added more informative syntax error messages
    Fixed values like ">6000" in strings being mangled
    Confirm support for named sections

    1.6
    Released 2014-10-10
    Added support for numeric registers
    Correct handling of comments
    Added support for dwarf debugging information

    1.7

    Released 2014-12-04

    Restored ability to have label and code on same line
    Minor code cleanup

     

    GCC Changelog

    -----------------------

    1.8
    Released 2013-05-01
    Fixed R11 restoration in epilogue being dropped by DCE
    Added support for named sections
    Removed support for directly zeroing byte memory, was buggy in some memories

    1.9
    Released 2014-10-10
    Changed order of jumps for less-than-or-equal tests to improve performance
    Fixed several integer type conversion bugs
    Corrected handling of variable shift by zero bits
    Fixed signed division
    Added support for dwarf debugging information

     

    1.10

    Released 2014-12-04

    Prevented use of R0 as an address base
    Moved jump tables into text segment to free up space for variables
    Fixed bug which put initialized data in bss section
    Fixed negation of byte quantities
    Minor code cleanup

    1.11
    Released 2015-06-14
    Fixed compilation error due to missing FILE macro in tms9900.h
    Some instruction sizes were defined incorrectly, causing assembly errors
    Fixed conditional jump displacement limits, they were too small.
    Added compilation pass to add needed SWPB instructions.

    1.12
    Released 2015-08-16
    Fixed bug when dividing by constant value
    Improved type testing for instruction arguments
    Added text to "--version" flag output to show patch version.

    1.13

    Released 2016-11-23
    Added compilation pass to better use post-increment addressing
    Ensured word alignment for symbols
    Removed optimization of tests against zero, they emitted unnecessary opcodes
    Fixed 32-bit shift instructions
    Fixed shift instructions to handle shift by zero bits
    Fixed and instruction to use ANDI when appropriate
    Added optimizations for shift of 32-bit value by 16 bits
    Fixed multiply to prevent using MPY with an immediate operand

    1.14

    Released 2017-02-19

    Added tail call optimization

    Confirmed C++ support

     

    1.15

    Released 2017-05-29
    Added .size directive to compiled output
    Fixed several instruction lengths
    Fixed multiply bug
    Reduced patch size

     

    1.16
    Released 2017-08-24
    Fixed 32-bit right constant shift, failed with some constants
    Fixed all 32-bit variable shifts, sometimes used r0 as temp register
    Fixed carry bit in 32-bit add
    Fixed invalid instruction in some 32-bit add forms

    1.17
    Released 2018-10-25
    More strict checks for address in BL commands
    Optimization for 32-bit left shift by 8 bits
    Optimization for 32-bit logical right shift by 8 bits
    Fixed 32-bit right shift by more than 16 bits
    Fixed 8-bit multiplies

     

    1.18
    Released 2018-10-31
    Fixed 16-bit signed right shift
    Fixed 32-bit unsigned right shift

    1.19
    Released 2019-02-26
    Removed side-effects from zero compares

    hello.tar.gz

    gcc-4.4.0-tms9900-1.0-patch.tar.gz

    binutils-2.19.1-tms9900-1.0-patch.tar.gz

    elf2cart.tar.gz

    elf2ea5.tar.gz

    binutils-2.19.1-tms9900-1.7-patch.tar.gz

    gcc-4.4.0-tms9900-1.11-patch.tar.gz

    gcc-4.4.0-tms9900-1.12-patch.tar.gz

    gcc-4.4.0-tms9900-1.13-patch.tar.gz

    gcc-4.4.0-tms9900-1.14-patch.tar.gz

    hello_cpp.tar.gz

    gcc-4.4.0-tms9900-1.15-patch.tar.gz

    hello2.tar.gz

    gcc-4.4.0-tms9900-1.16-patch.tar.gz

    gcc-4.4.0-tms9900-1.17-patch.tar.gz

    gcc-4.4.0-tms9900-1.18-patch.tar.gz

    gcc-4.4.0-tms9900-1.19-patch.tar.gz

    gcc-installer.tar.gz

    • Like 13
×
×
  • Create New...