Assembly on the 99/4A

sometimes99er · October 27, 2017

Let's assume I have r1 pointing to a byte in memory, and I want to test it for being zero (the byte that is!). What would be the shortest (fastest) way of doing that.

	movb	*r1,r2		; 22 cycles - requires r2
	movb	*r1,*r1		; 26 "
	cb	*r1,r2		; 22 "      - requires r2 to contain value
	cb	*r1,@h0000	; 38 "      - requires address to contain value
h0000	data	>0000

Way back when I first wrote my TI99/4 assembly program I loved the capitals, right now I write my 9900 programs in lowercase!

Me too.

matthew180 · October 27, 2017

I find that comparing to zero is not something I do specifically, so it really depends on the circumstances. For example, decrementing for a loop, the compare to zero is done for you:

  LI R2,32
LOOP:
  * do stuff
  DEC R2
  JNE LOOP

For your example where you are using a register as a pointer, and maybe dealing with a null terminated string, you could work the test into the loop itself and avoid a specific test for zero. Something like this maybe:

TEXT DB  "HELLO",0
BUFF BSS 40

  LI R1,TEXT
  LI R2,BUFF
LOOP:
  MOVB *R1,*R2
  JNE  LOOP

Of course this assumes the destination is expecting that final 0 (zero) byte. But this is just an example to demonstrate the idea that you can probably build the test into your loop without having to perform a specific compare for zero. If you do need to compare specifically, the summary in sometimes99er's post above shows the fastest options. Also keep in mind that the 9900 uses a RAM-based register-file, so even something like MOV *R1,*R2 is going to cause memory access (I get the feeling you know this).

I think the 9900 was heavily influenced by the PDP-11.

Edited October 27, 2017 by matthew180

+mizapf · October 27, 2017

Furthermore (r1) means *R1, just a different notation.

Way back when I first wrote my TI99/4 assembly program I loved the capitals, right now I write my 9900 programs in lowercase!

which means the standard assembler will run into trouble, won't it?

Although I'd prefer lowercase from all other programming languages, I'm using capitals for TMS9900 assembler because the language is defined by capitals in the manual.

It's Sparky · October 27, 2017

which means the standard assembler will run into trouble, won't it?

Although I'd prefer lowercase from all other programming languages, I'm using capitals for TMS9900 assembler because the language is defined by capitals in the manual.

yes, I think so. That is why I use my own assembler! I needed an assembler/linker/loader toolset that is able to do split I/D for a future 99105 project (this will increase your effective memory to 128K since instructions and data are separated).

While creating it, I added some nice features, like local (reusable) labels to prevent name space cluttering:

    li r1,buffer
1:  movb *r1+,r2
    jne 1b

Automatic byte and word literal generation (done by the loader in text or data segment):

space  equ   ' '
...
       cb    *r1,=b(space)

Literals with the same value will be mapped to the same address.

Automatic long/short jump expansion:

    c r1,r2
    bjne equal

This will automatically expand to a (short) jne, or a skipping jeq and a branch if the destination label is out of reach.

My loader generates mini-memory.ram files that are compatible with ti99sim, so I can directly run my programs in the simulator

+OLD CS1 · October 27, 2017

which means the standard assembler will run into trouble, won't it?

Although I'd prefer lowercase from all other programming languages, I'm using capitals for TMS9900 assembler because the language is defined by capitals in the manual.

I ran into this when programming 6502. Some sources are lower-case and some are upper. What I tend to do, and this has migrated back into my 9900 programming, is use a different case dependent upon the situation. For instance, if I am re-using code -- my own or someone else's -- I will preserve that case and use the opposite for whatever changes or additions I make. In 9900, I will use a different comment delimiter, as well, such that if the source I am using has asterisks for comments I will use semicolons and vice-versa.

It's Sparky · October 27, 2017

I find that comparing to zero is not something I do specifically, so it really depends on the circumstances. For example, decrementing for a loop, the compare to zero is done for you:
  LI R2,32
LOOP:
  * do stuff
  DEC R2
  JNE LOOP
For your example where you are using a register as a pointer, and maybe dealing with a null terminated string, you could work the test into the loop itself and avoid a specific test for zero. Something like this maybe:
TEXT DB  "HELLO",0
BUFF BSS 40

  LI R1,TEXT
  LI R2,BUFF
LOOP:
  MOVB *R1,*R2
  JNE  LOOP
Of course this assumes the destination is expecting that final 0 (zero) byte. But this is just an example to demonstrate the idea that you can probably build the test into your loop without having to perform a specific compare for zero. If you do need to compare specifically, the summary in sometimes99er's post above shows the fastest options. Also keep in mind that the 9900 uses a RAM-based register-file, so even something like MOV *R1,*R2 is going to cause memory access (I get the feeling you know this).

I think the 9900 was heavily influenced by the PDP-11.

Yes I think you are right. The increase of the number registers (from 8 on the pdp11 to 16 on the 9900) costs a lot of instruction space. So the TI developers dropped the auto-decrement indirect addressing mode, and the nice orthogonal immediate addressing mode (and created seperate instructions for that). Maybe the tst(b) instruction was also deleted to make things fit. Still I feel uncomfortable to use an instruction that writes back to memory without a real reason. Maybe that is just me!

Edited October 27, 2017 by It's Sparky

Asmusr · October 27, 2017

which means the standard assembler will run into trouble, won't it?

Although I'd prefer lowercase from all other programming languages, I'm using capitals for TMS9900 assembler because the language is defined by capitals in the manual.

I have faithfully been using upper case and short labels for years, but with the Knight Lore project I finally changed to lower case and long labels. The code is now easier to type and long labels makes it easier to structure and understand. Even when I used upper case I always used cross platform assemblers: Asm994A to begin with and later xas99. I have always used conditional assembly instructions, e.g ifdef, so the code has never compiled on E/A anyway.

matthew180 · October 27, 2017

... Still I feel uncomfortable to use an instruction that writes back to memory without a real reason. Maybe that is just me!

Well, sadly, you will get that a lot on the 9900. Also, the whole read-before-write just because TI left out the upper-byte / lower-byte control pins.

There are two instructions that don't write back to memory, yet can be used to test for 0:

ABS

CB

The problem with CB is that it requires a source to compare against, which means at the very least another memory read. However, if you dedicate two registers to always be 0 and 1, then it is probably one of the fastest methods.

The ABS instruction skips the write to memory if the MS-bit of the original value was already zero. However, ABS compares the whole 16-bits, so the LSB would already have to be zero, and the MSB would have to be between 0 and 127 to avoid the write to memory. Finally, ABS compares the value "before" it is converted to a positive value.

matthew180 · October 27, 2017

... I have always used conditional assembly instructions, e.g ifdef, so the code has never compiled on E/A anyway.

I quit worrying about E/A compatibility a long time ago. It was fine BITD, but the pain of writing code on the console and the confines of the E/A are nothing I miss or care to relive.

It's Sparky · October 28, 2017

Hope you guys don't mind digging up an old but interesting post by matthew.

...
However, that is not what happens. For the B instruction, indirect addressing will use the *value* of a register as the memory address to branch to. Lottrup's book actually has the best explanation I could find:

"The line B *R11, which returned several example programs to EASY BUG, meant to branch to the memory location addressed by the value in register 11."

In other words, use the *value* of R11 *as* the address the branch to, and not as the address to of where to look for an address to branch to.

Just as matthew describes, thinking about the meaning of the classic B *R11 instruction is contra intuitive. Funny detail, B R11 is a completely valid instruction which will 'jump into your workspace, executing your registers (starting at R11)', which has obviously limited usability. Maybe TI developers realised the need for a 'real' indirect jump when they created the BIND (Branch INDirect) instruction on the 99000. This powerful instruction can be used to jump to a routine from a jump table (using indexed addressing) or even returning from a subroutine call where the return address is on a 'stack': BIND *R10+

The introduction of this BIND instruction was paired with the BLSK (Branch and Link StacK) instruction. BLSK R10 is comparable with the standard BL instruction, but instead of storing the return address in R11, it will be stored in the address pointed by R10 (the 'stack' pointer) that will be pre-decremented by 2 (so before the storing is done). A perfect pair of instructions when you want to implement stack-like behaviour.

Maybe it is a cool idea to ramble about enhancing the 9900. Which instructions would you like to see? Maybe a super fast register set on chip? Of course it depends on your taste of your way of programming. Still love the original instruction set, don't get me wrong!

Franc

It's Sparky · November 8, 2017

While writing some code to demonstrate 9900 assembly to my students I ran into the situation where my code needs to be readable and understandable. One of the easiest ways to accomplish this is to use functions (routines/subroutines) to break up and re-use code.

Of course, the use of functions involves a choice in calling conventions. On the 9900 we have several possibilities, for example those based on BLWP/RTWP or BL/B *R11. Apart from calling to and returning from a routine a decision should be made about passing of parameters. Inspired by other architectures (for example Sun/Sparc) I created a calling convention which is easy to use and has a lot of advantages:

· Recursion is possible

· Routine and subroutine pass parameters through registers

· Each incarnation has its own free registers, no more implicit saving/restoring registers

· Each incarnation has a number of scratch registers

· Calling sequence is just 2 words

· No implicit stack administration needed in routines

The basic idea behind this convention is to use overlapping workspaces: the routine and the subroutine share half of their workspace. So, there is a ‘stack of workspaces’ (growing from hi addresses to lo addresses). Each incarnation will use 16 bytes. Although this seems to be a lot, the total amount of memory used for the system depends on the depth of nested subroutines, which is in most applications really limited. The following table illustrates the idea:

When a function needs temporary register storage it can freely use R5, R6 and R7. However, their content will be destroyed as soon as the function calls another function. Parameters to the function are in R8 up to and including R12. Free (and persistent through calls) registers are R0 up to and including R4 which are also used to pass parameters to a subfunction.

To call a function, I wrote some code that handles all the administration through the use of the XOP instruction. So my assembler will rewrite:

     call routine

to

     xop routine,1

Which takes the same amount of instruction space as BL (2 words)

The routine itself will return to the caller using a standard RTWP instruction. Note that you can manipulate R15 before returning to signify an error condition (for example by setting the parity flag).

The XOP1 handler looks like this (a nice puzzle to see what is going on)

xop1:
      mov     r13,10(r13)
      mov     r14,12(r13)
      mov     r15,14(r13)
      mov     r11,r14
      ai      r13,-16
      rtwp

Funny that the ending rtwp is actually calling the routine!

Wonder what you think of this.

Edited November 8, 2017 by It's Sparky

apersson850 · November 8, 2017

I assume you all know that the TMS 9900 microprocessor actually implements the instruction set of the TI-990/9 minicomputer? The TI-990 was the successor of the TI-980 (now that was unexpected...). Where TI got the inspiration for the changes made between the two I don't know, but they sure changed quite a bit. The TI-980 is a much more conventional CPU design.

The assembler supplied with the p-system for the 99/4A has more functionality than the scaled down TI-990 assembler we know as the E/A package.

Overlapping registers was the parameter passing scheme we selected for the DSK 86000 CPU, designed as a dedicated robotics control CPU back in the 80's.

Edited November 8, 2017 by apersson850

matthew180 · November 12, 2017

...

Wonder what you think of this.

My two cents:

Personally I'm a speed and memory use freak, so to me it seems overly complicated. With a fixed number of possible parameters, it will be overkill for some calls, and not enough for others. I think a valuable lesson you might impart on your students is that the right solution always depends on the system, language, and circumstances. On limited systems you are always close to the hardware and have very limited resources (small amounts of RAM, probably not virtual memory, probably not fast disk storage, etc.), so the solutions used in modern languages like C, C++, Java, etc. with large memory don't always work well on a computer like the 99/4A.

It seems that using the stack for parameters-only would make better use of memory and be more flexible, i.e. if you only need to pass 1 or 2 parameters, then you only use memory for 1 or 2 parameters. IMO all variables should be stored in memory and registers only used for temporary / immediate calculations. Following this idea means you don't have to worry about preserving registers between subroutine calls. Too many times I have seem programs that try to set up registers for specific uses through-out the program, and there is a lot of dancing around to keep registers intact.

I'm also not a fan of recursion and people try too hard to find ways to use it; just use a loop.

+TheBF · November 12, 2017

While writing some code to demonstrate 9900 assembly to my students I ran into the situation where my code needs to be readable and understandable. One of the easiest ways to accomplish this is to use functions (routines/subroutines) to break up and re-use code.

Of course, the use of functions involves a choice in calling conventions. On the 9900 we have several possibilities, for example those based on BLWP/RTWP or BL/B *R11. Apart from calling to and returning from a routine a decision should be made about passing of parameters. Inspired by other architectures (for example Sun/Sparc) I created a calling convention which is easy to use and has a lot of advantages:

· Recursion is possible

· Routine and subroutine pass parameters through registers

· Each incarnation has its own free registers, no more implicit saving/restoring registers

· Each incarnation has a number of scratch registers

· Calling sequence is just 2 words

· No implicit stack administration needed in routines

The basic idea behind this convention is to use overlapping workspaces: the routine and the subroutine share half of their workspace. So, there is a ‘stack of workspaces’ (growing from hi addresses to lo addresses). Each incarnation will use 16 bytes. Although this seems to be a lot, the total amount of memory used for the system depends on the depth of nested subroutines, which is in most applications really limited. The following table illustrates the idea:

When a function needs temporary register storage it can freely use R5, R6 and R7. However, their content will be destroyed as soon as the function calls another function. Parameters to the function are in R8 up to and including R12. Free (and persistent through calls) registers are R0 up to and including R4 which are also used to pass parameters to a subfunction.

To call a function, I wrote some code that handles all the administration through the use of the XOP instruction. So my assembler will rewrite:
     call routine
to
     xop routine,1
Which takes the same amount of instruction space as BL (2 words)

The routine itself will return to the caller using a standard RTWP instruction. Note that you can manipulate R15 before returning to signify an error condition (for example by setting the parity flag).

The XOP1 handler looks like this (a nice puzzle to see what is going on)
xop1:
      mov     r13,10(r13)
      mov     r14,12(r13)
      mov     r15,14(r13)
      mov     r11,r14
      ai      r13,-16
      rtwp
Funny that the ending rtwp is actually calling the routine!

Wonder what you think of this.

I created a multi-tasking context switch using the RTWP instruction. It's a pretty powerful instruction when used "backwards" this way.

Something that is worth considering for parameter passing is blending a stack with a register or two.

If you assign 1 register to the job of holding the TOP of stack like a little cache, then the stack becomes more efficient.

You can go with 2 cached registers for top and 2nd item, but that can result in more register pushing and popping than it's worth.

Another method that I have not fully explored would use a stack but then exploit BLWP to move the workspace to the top of the stack space.

This would let you push parameters onto a stack as outputs of a routine for example and then process them with register instructions.

I have not worked out the dynamics of this in detail, but it is something that the 9900 can do. It would require that the stack grows upward I believe so that you can preserve R13,R14,R15 above the stack.

I must confess that I find overlapping register sets more complicated to think about than a stack, but that could be and Intel bias. :-)

Airshack · December 3, 2017

http://ataripodcast.libsyn.com/antic-interview-316b-dave-comstock-part-2

The link above points to an interview with Dave Comstock, an Atari programmer from the earliest Atari days. At 20:20 into this interview link he begins talking about a scheme used to speed things up in loading screen images for Ball Blazer.

Self modifying Assembly code. I cant quite grasp what hes talking about. I know Rasmus did something BallBalzer-esq with his demo code...and in Skyway?

Wondering if anyone here can give this conversation (a few minutes starting at 20:20) a listen and clarify what this clever coding hack is all about.

I wish I had a diagram of what Dave is pointing out.

PeteE · December 4, 2017

Self modifying Assembly code. I cant quite grasp what hes talking about. I know Rasmus did something BallBalzer-esq with his demo code...and in Skyway?

For obvious reasons, self-modifying code works on code running in RAM, not in ROM. So you have your code in RAM, the code is just words that tell the CPU what to do. If you change the words, the CPU will do other things.

After listening to the audio, it sounds like ballblazer had a very long unrolled loop of load and store instructions in memory to do a copy. For a TI example, this loop will copy bytes using a MOVB instruction:

; Copy R2 bytes from address R0 to address R1
!  MOVB *R0+,*R1+
   DEC R2
   JNE -!

Unrolled, eliminating the DEC and JNE instructions and repeating the MOVB instruction, would look like this:

UNROLL MOVB *R0+,*R1+
       MOVB *R0+,*R1+
       MOVB *R0+,*R1+
       MOVB *R0+,*R1+
       MOVB *R0+,*R1+
       MOVB *R0+,*R1+
       MOVB *R0+,*R1+
; ... hundreds more?

The self-modifying code comes in when you want to decide how many bytes to copy. If the number of bytes to copy is in R2, then you would change the MOVB instruction at offset 2 * R2 into a B instruction. The word for B *R11 (aka RT) is >04B5, so to modify the code you would do something like this:

; Copy R2 bytes from address R0 to address R1
  A R2,R2        ; R2 = R2 * 2
  AI R2,UNROLL   ; address of unrolled loop
  MOV *R2,R6     ; save the old opcode in R6
  LI R5,>04B5    ; R5 = opcode for the RT instruction
  MOV R5,*R2     ; self-modify the code
  BL @UNROLL     ; run the ball-blazing-fast unrolled loop
  MOV R6,*R2     ; restore the old opcode

An alternative method that doesn't require self-modifying code would be to have a RT at the end of the unrolled instructions, and instead calculate and jump to the starting address of the MOVB that is R2 instructions before the RT. Perhaps ballblazer may have been using absolute addresses for load and store, so maybe the programmer couldn't do it without using self-modifying code. Either way, a cool trick for getting fastest possible code.

+arcadeshopper · December 5, 2017

Answering rather late, sorry.

I always put my workspace at >8300. No particular reason other than I have always done it that way and it is out of the way of using the rest of the scratch pad RAM for variable storage. IMO you should *always* use a workspace in the 16-bit scratch pad RAM, otherwise you pay a hefty performance penalty. As others have mentioned, if you are going to use console routines (ROM or GROM) or allow the console ISR to run, then you will need to know and respect the use of scratch pad based on those services. Also, if you are interfacing with XB then I think there are additional constraints.

A command module can contain both ROM and GROM.

That would be a good assumption. Secondary assumptions might be cost, component size, or number of pins.

Without bank-switching, yes, you are limited to 8K. With bank-switching you are limited to 8K at *any single time*. Basically bank-switching gives you an 8K window into the larger memory space.

I'm not totally up on my console differences, but as others have said I think carts that do not have GROM do not show up on the menu. Also, I think the QI console removed some of the physical connections to the cartridge port, so ROM is not physically possible. I might be wrong about that though, so check the facts.

No idea. Greg?

_Any_ ROM cartridge needs a 2.2 work around. The playground loader is in the forums here or I sell a disk for the cost of the disk that will allow you to execute a rom on a 2.2 console. Side-port carts work on 2.2 consoles as they don't use the menu at all, they just take over at boot.

Greg

Airshack · December 8, 2017

The self-modifying code comes in when you want to decide how many bytes to copy. If the number of bytes to copy is in R2, then you would change the MOVB instruction at offset 2 * R2 into a B instruction. The word for B *R11 (aka RT) is >04B5, so to modify the code you would do something like this:
; Copy R2 bytes from address R0 to address R1
  A R2,R2        ; R2 = R2 * 2
  AI R2,UNROLL   ; address of unrolled loop
  MOV *R2,R6     ; save the old opcode in R6
  LI R5,>04B5    ; R5 = opcode for the RT instruction
  MOV R5,*R2     ; self-modify the code
  BL @UNROLL     ; run the ball-blazing-fast unrolled loop
  MOV R6,*R2     ; restore the old opcode
WOW! Thank you PeteE for researching the podcast link and clarifying everything with this well commented native TI assembly coded answer!

I see where the conversation left me and I appreciate the time you put into this complete answer.

+Vorticon · January 10, 2018

Hi.

As I am working on my MM project, I noticed that KSCAN was trashing R6 and R7. Is that normal? There is nothing I can find in the MM or EA manuals about this...

+TheBF · January 10, 2018

Sitting in my forth interpreter I see R6 changing every now and then.

All I do while waiting for a key stroke is call KSCAN waiting for a keypress.

I don't even read the character until a key is pressed.

I see no changes in R7.

For reference here is code in RPN assembler. (TOS is alias for R4. @@ is indirect addressing)

\ Camel99 interface to KSCAN
CODE: (KEY?) ( key-unit -- ?)  \ *WARNING* it takes 1200 uS for key scan to run
             TOS R3 MOV,
             TOS CLR,                   \ TOS will be our true/false flag
             R3 8374 @@ MOVB,           \ set the key-unit# (see TI BASIC Ref. pg II-87)
             0 LIMI,                    \ stop interrupts
             83E0 LWPI,                 \ switch to GPL workspace
             000E @@ BL,                \ call ROM keyboard scanning routine
             8300 LWPI,                 \ return to Forth's workspace 
             2 LIMI,
             837C @@ R0 MOVB,           \ read GPL status byte
             R0 2000 ANDI,              \ mask for key-pressed bit
             @@1 JEQ,
             TOS SETO,                  \ Key pressed: set flag to true
             SCRTO @@ CLR,              \ clr screen timeout
@@1:         NEXT,                      \ return
             END-CODE

Edited January 10, 2018 by TheBF

+Vorticon · January 11, 2018

Yup that's what I see too. I was not however aware that KSCAN affected any of the registers...

+Lee Stewart · January 11, 2018

Yup that's what I see too. I was not however aware that KSCAN affected any of the registers...

KSCAN uses pretty much all of the registers (its own). Are you not calling it with ~~its own~~ the GPL workspace (LWPI >83E0 BL @>000E)?

...lee

[Edits in this color.]

+Vorticon · January 11, 2018

I call it using BLWP @>6020 per the MM manual. It should not technically affect any of the user registers but rather use it's own workspace... The only way around the issue was to save R6 prior to calling KSCAN, which wastes memory.

+Lee Stewart · January 11, 2018

I call it using BLWP @>6020 per the MM manual. It should not technically affect any of the user registers but rather use it's own workspace... The only way around the issue was to save R6 prior to calling KSCAN, which wastes memory.

Yeah...there is nothing there that should change either of the user’s R6 or R7 that I can see. The only other place where something could happen is the ISR, and you probably have interrupts disabled while you call KSCAN. BTW where is your workspace, >8300?

...lee

+Vorticon · January 11, 2018

Actually I removed the LIMI 2 LIMI 0 to save bytes as they had no effect on KSCAN and did not prevent the user registers corruption. I'm using the default MM user WS at >70B8. The MM utilities WS is at >7092. Completely stumped here...

Assembly on the 99/4A

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members