Assembly on the 99/4A

+Vorticon · March 25

Writing to the SAMS seems slower than reading from it. Is that a thing or something on my end?

apersson850 · March 25

I see. Yes, my own memory expansion, which can show up anywhere in the address space, is indeed installed by modifications inside the console.

The p-system uses the DSR space itself almost always.

My feeling is that the p-system is easiest to expand with a RAMdisk.

Edited March 25 by apersson850

+TheBF · March 25

2 hours ago, Vorticon said:

Writing to the SAMS seems slower than reading from it. Is that a thing or something on my end?

Once a page is mapped in to a RAM window it is the same as the RAM that it replaced. Literally on the same card.

I have never noticed a difference. ( I will double check now)

Something you might try is keep a variable of the SAMS page that is mapped in and only call the mapping function if it needs to change.

Sometimes that speeds things up but sometimes the logic takes longer than just doing the mapping.

+TheBF · March 25

So my quick and dirty result is the opposite. (of course they did)

I think it's because when I read SAMS I have to throw away the result with DROP so that's extra time spent in the loop.

The test code reads a 64K chunk of SAMS memory as 32K memory words.

The SAMS code is assembler and the tests are Forth.

There is computation to convert a virtual address (0..>FFFF) to the real address in RAM so that takes a bit of time.

The screen shot shows the results.

+Vorticon · March 25

I'll have to come up with a timing test to accurate characterize what is going on here. Maybe it's my imagination 😁

Tursi · March 26

In theory writes to any RAM is slower than reading from it, because most operations do a read-before-write... meaning that every write is actually two operations.

It's not possible to slow down a memory access without wait state hardware on the device, which I'm pretty sure the AMS doesn't have. If the memory isn't ready, the CPU will just take whatever's on the bus.

(At the lowest level, that is)

Edited March 26 by Tursi

Asmusr · March 26

16 hours ago, hhos said:

Two of these should give some idea of what is needed to add the function you want.

I'm not trained to read schematics, so perhaps you could explain in simple terms what changes it would take to an existing card to support paging in the >6000 - >7fff region? I'm just curious why they chose not to do this in the first instance, since it would be a very useful addition.

+TheBF · March 26

9 hours ago, Tursi said:

In theory writes to any RAM is slower than reading from it, because most operations do a read-before-write... meaning that every write is actually two operations.

It's not possible to slow down a memory access without wait state hardware on the device, which I'm pretty sure the AMS doesn't have. If the memory isn't ready, the CPU will just take whatever's on the bus.

(At the lowest level, that is)

That Tursi is not just a pretty face.

I wrote up a better test (I think) that uses MOVE which is an Assembly language word that moves bytes.

On real hardware, it takes 4K of data from ROM at >0000 and writes to 64K of SAMS in 4K chunks (16 pages)

I then did the reverse and wrote the SAMS data back to ROM.

Reading is much faster.

Edit: Changed WRITEPAGES to do exactly the same thing as READPAGES

Spoiler

NEEDS DUMP FROM DSK1.TOOLS 
NEEDS FORGET FROM DSK1.FORGET 
NEEDS ELAPSE FROM DSK1.ELAPSE 
NEEDS VIRT>REAL FROM DSK1.SAMS

HEX 1000 CONSTANT 4K
    2000 CONSTANT $2000

1 SEGMENT \ uses SAMS pages 16..32 (64K)

DECIMAL 
\ copy 4K in ROM from >0000 to SAMS 
: WRITEPAGES 
    65535 0 
    DO 
       0000  I VIRT>REAL 4K MOVE  
    4K +LOOP 
;

\ copies back to ROM address 
: READPAGES 
    65535 0
    DO 
      I VIRT>REAL  0000 4K MOVE 
    4K +LOOP
;

matthew180 · March 26

6 hours ago, Asmusr said:

I'm not trained to read schematics, so perhaps you could explain in simple terms what changes it would take to an existing card to support paging in the >6000 - >7fff region? I'm just curious why they chose not to do this in the first instance, since it would be a very useful addition.

I have not looked at the SAMS schematics, but I suspect a reason for not paging the >6000 area is to have a stable area for code that does not page (controls the paging), and an area that does get paged. But this is really speculation unless someone was there and knows why the design decisions were chosen.

For the technical part, probably more address decoding and new registers to control the new page area. Again, speculation since I have not looked at the SAMS implementation, but all basic external memory pagers work the same.

matthew180 · March 26

12 hours ago, TheBF said:

I wrote up a better test (I think) that uses MOVE which is an Assembly language word that moves bytes.

Your test looks like Forth, not assembly; there is no "MOVE" instruction in assembly. `MOVB` is the assembly language instruction to move bytes, `MOV` will move a word (16-bit). ~~`MOV` does not incur the read-before-write penalty~~, so that would be a better test to compare read vs write speed.

Edit: heh, I forgot, the 9900 still does the read-before-write even for 16-bit operations.

Edited March 27 by matthew180
technical correction

+TheBF · March 26

11 minutes ago, matthew180 said:

Your test looks like Forth, not assembly; there is no "MOVE" instruction in assembly. `MOVB` is the assembly language instruction to move bytes, `MOV` will move a word (16-bit). `MOV` does not incur the read-before-write penalty, so that would be a better test to compare read vs write speed.

Yes it's Forth but MOVE is written in Forth Assembler. Forth is just the console to call the program and put it in a loop.

CODE MOVE   ( src dst n -- )   \ forward character move
            *SP+ R0  MOV,      \ pop DEST into R0
            *SP+ R1  MOV,      \ pop source into R1
            TOS TOS MOV,
            NE IF,            \ if n=0 we are done
\ need some copies
                R0  R2 MOV, \ dup dest
                R0  R3 MOV, \ dup dest
                TOS R3 ADD, \ R3=dest+n
\ test window:  src  dst dst+n WITHIN
                R0  R3 SUB,
                R1  R2 SUB,
                R3  R2 CMP,
                HI IF, \ do cmove> ............
                
                    TOS W MOV,      \ dup n
                        W DEC,      \ compute n-1
                    W  R1 ADD,      \ point to end of source
                    W  R0 ADD,      \ point to end of destination
                    BEGIN,
                      *R1 *R0 MOVB,
                       R1 DEC,     \ dec source
                       R0 DEC,     \ dec dest
                       TOS DEC,    \ dec the counter in TOS (R4)
                    EQ UNTIL,

                ELSE,  \ do cmove .............
                    BEGIN,
                      *R1+ *R0+ MOVB, \ byte move, with auto increment by 1.
                       TOS DEC,        \ we can test it before the loop starts
                    EQ UNTIL,
                ENDIF,
            ENDIF,
            TOS POP,
            NEXT,
            ENDCODE

+TheBF · March 26

16 minutes ago, matthew180 said:

Your test looks like Forth, not assembly; there is no "MOVE" instruction in assembly. `MOVB` is the assembly language instruction to move bytes, `MOV` will move a word (16-bit). `MOV` does not incur the read-before-write penalty, so that would be a better test to compare read vs write speed.

I take your point that MOV would be a better test than MOVB as I am using in my program.

I can check that too.

+TheBF · March 26

For full disclosure VIRT>REAL "program" is written in Forth Assembler also.

(It uses machine code in the system to avoid loading the assembler.)

HEX
CODE VIRT>REAL  ( virtaddr -- real_address )
\ Lee Stewart's code to replace >1000 UM/MOD. 2.3% faster 
     C020 , SEG ,   \ SEG @@ R0 MOV,    \ segment# to R0
     0A40 ,         \ R0 4 SLA,         \ page# segment starts
     C144 ,         \ R4 R5 MOV,        \ address to R5
     0245 , 0FFF ,  \ R5  0FFF ANDI,    \ page offset
     09C4 ,         \ R4 0C SRL,        \ page of current segment
     A100 ,         \ R0 R4 ADD,        \ bank#

     8804 , BANK# , \ R4 BANK# @@ CMP,    \ switch page ?
     1309 ,         \ NE IF,
     C804 , BANK# , \    R4 BANK# @@ MOV, \ YES, update BANK#
     06C4 ,         \    R4 SWPB,
     020C , 1E00 ,  \    R12 1E00 LI,      \ select SAMS
     1D00 ,         \    0 SBO,            \ card on
     C804 , SREG ,  \    R4 SREG @@ MOV,   \ map the page
     1E00 ,         \    0 SBZ,            \ card off
                    \ ENDIF,
     0204 , PMEM ,  \ R4 PMEM LI,         \ page_mem->tos
     A105 ,         \ R5  R4 ADD,         \ add computed offset to page
     
     NEXT, \ 46 bytes
     ENDCODE

+TheBF · March 26

40 minutes ago, matthew180 said:

Your test looks like Forth, not assembly; there is no "MOVE" instruction in assembly. `MOVB` is the assembly language instruction to move bytes, `MOV` will move a word (16-bit). `MOV` does not incur the read-before-write penalty, so that would be a better test to compare read vs write speed.

Wow! Your are so right @matthew180.

I had no idea.

I replaced the MOVE program with MOVEW which uses MOV in a loop to move memory words.

There is almost no difference in speed now.

For reference here is MOVEW in Forth Assembler.

Spoiler

\ MOVEW   replaces MOVE16.     Jul 2022 Brian Fox

CODE MOVEW  ( src dst n -- ) \ n= no. of bytes to move
   *SP+ R0 MOV,
   *SP+ R1 MOV,
      BEGIN,
       *R1+ *R+ MOV,
       TOS DECT,
   NC UNTIL,
   TOS POP,
   NEXT,
ENDCODE

hhos · March 26

On 3/26/2024 at 2:30 AM, Asmusr said:

I'm not trained to read schematics, so perhaps you could explain in simple terms what changes it would take to an existing card to support paging in the >6000 - >7fff region? I'm just curious why they chose not to do this in the first instance, since it would be a very useful addition.

First, I'll assume they would refer to the Super AMS design team.🙂 They were probably just wanting to use the RAM spaces that had been already allocated, and for good reason I might add. Whenever you depart from the default memory map you risk having two memories dueling for control of the same data bus. This is always a bad idea, obviously. I would bet on all the 0s winning in any data bit conflicts myself.😀

In the case of the cartridge port memory that's probably not a big deal. Your software can easily test for the presence of a cartridge before turning on the switch that makes the RAM mapped to 6000-7FFF available. Someone plugging in a cartridge while the RAM is available will just cause a reset of the system, so very little harm done?

This is not so with the 0000, 4000, and 8000 banks. The 0000-1FFF space has the system ROM there until it is disabled(requires console modification). Once the system ROM is locked out the majority of conflicts would probably be related to the GPL interpreter being unavailable? It's something to plan for if you're going to do it. I believe there is someone who has his console modified so he can have 64K of RAM mapped in all inside the console. Apersson850 is the one that comes to mind, but I'm not absolutely certain on that. Anyway, there are far better sources than myself on what the loss of the system ROM would mean.

The 4000-5FFF space is a bit more complicated. If you have this activated and then find you need to access the floppy drive or the RS232C port then you are back to the dueling data scenario. The bit that activates the 4000-5FFF region on my design is bit5 with the CRU base address at >5FC0. To shut it down:

~~LIMI 12,>5FC0 the SAMS doesn't have a plan for this AFAIK~~

~~LIMI 12,>1E00~~ the SAMS doesn't have a plan for this AFAIK

LI 12,>1E00

SBZ 5

Then you can do your call to the DSR of your choice.

In short, to answer your question, it's a pain to do use these areas for extra RAM, but it might be worth it. They decided it wasn't worth the trouble. You decide if it's worth it for you.😊

HH

Edited April 6 by hhos
Wrong address for CRU. And then the wrong instruction.

apersson850 · March 26

6 hours ago, matthew180 said:

`MOV` does not incur the read-before-write penalty, so that would be a better test to compare read vs write speed.

Yes, it does. The instructions that have byte variants all read before write, just because that was easiest when the logic in the CPU was designed. But a Load Immediate doesn't read the register first, before it stores a value there, as there is no Load Immediate Byte version.

If you check the number of memory accesses by MOV and MOVB you'll see that they are the same, four. Read instruction, read destination, read source, write destination.

apersson850 · March 26

12 minutes ago, hhos said:

I believe there is someone who has his console modified so he can have 64K of RAM mapped in all inside the console. Apersson850 is the one that comes to mind, but I'm not absolutely certain on that. Anyway, there are far better sources than myself on what the loss of the system ROM would mean.

The 4000-5FFF space is a bit more complicated. If you have this activated and then find you need to access the floppy drive or the RS232C port then you are back to the dueling data scenario.

Yes, I have such a design in my console. The 64 K RAM can be mapped over all 8 K segments in the console, one at a time. If you map out the console ROM you have to map it in before returning from your assembly program, or you die... But you can copy all of the ROM to RAM and then for example modify the interrupt vectors if you want to do something with the TMS 9901, for example.

If works fine to use the RAM over the console ROM for some buffer storage. Enable, store, disable or enable, read, disable.

The DSR space complicate things further if you, like @Vorticon want's to use SAMS together with Pascal, as the p-code card occupies this space too.

matthew180 · March 27

5 hours ago, apersson850 said:

Yes, it does. The instructions that have byte variants all read before write, just because that was easiest when the logic in the CPU was designed.

Sure enough, I forgot about that. The read-before-write info is not in the datasheet though (that I could find), and very easy to forget about. I only ever saw it mentioned as a note in the System Design book TI published.

11 hours ago, TheBF said:

Wow! Your are so right @matthew180.

I had no idea.

So, I was mistaken. There must be something else going on with your tests.

apersson850 · March 27

7 hours ago, matthew180 said:

The read-before-write info is not in the datasheet though (that I could find), and very easy to forget about. I only ever saw it mentioned as a note in the System Design book TI published.

Oddly enough, it's explained in detail in the TI-99/4A console and peripheral expansion system technical data, 1049717-1.

Here's a copy of section B.1.2.1

READ before WRITE considerations

There are different READ and WRITE addresses for most of the memory-mapped devices (MMDs). This is because the TMS 9900 does a READ operation at the destination address prior to writing to it. Many of the MMDs have internal address registers that autoincrement after either a READ or a WRITE operation. This autoincrement characteristic of the MMD may not produce the desired results if it is not taken into consideration when designs or modifications are made.

The READ before WRITE exists because the TMS 9900 is a word-oriented machine from the memory access standpoint. The several byte-oriented instructions are carried out by the machine in a word execution format, and the other byte in the word must not be altered. The machine itself must save the unaltered byte, concatenate the new byte to it, and return the word to memory. The internal logic of the TMS 9900 is designed this way because it was to the designers distinct advantage to do this same READ before WRITE on both byte and word moves.

Another thing I noticed in that document may shed some light on the discussion we've had about the usefulness of LIMI 2 when LIMI 1 does the same job. This is about how to design a card to be used in the expansion box.

There are two levels on which to interrupt, bu the TI-99/4A supports only one (INTA*). THIS IS THE ONE YOU MUST USE. Interrupt level status bits are defined by the Personal Computer PCC at Texas Instruments, and for the moment are not sensed by the TI-99/4A. If they were to be sensed, the TI-99/4A would cause a line to go low (SENILA*), which tells the PCB logic to gate its status bit to the system data bus.

I don't know, but there could be a reason for LIMI 2 being used buried here, since they obviously at some time had two interrupt levels in consideration. Reading this it sounds very much as somebody planned more computers than just the TI-99/4A to be able to use the expansion box.

+TheBF · March 27

10 hours ago, matthew180 said:

Sure enough, I forgot about that. The read-before-write info is not in the datasheet though (that I could find), and very easy to forget about. I only ever saw it mentioned as a note in the System Design book TI published.

So, I was mistaken. There must be something else going on with your tests.

Indeed. My MOVE program tests for overlapping memory and chooses one of two different loops. I might have an problem with that overlap detector.

+FarmerPotato · March 27

14 hours ago, apersson850 said:

The internal logic of the TMS 9900 is designed this way because it was to the designers distinct advantage to do this same READ before WRITE on both byte and word moves.

"to the designers distinct advantage" made sense to me--it was a trade-off. But what did it really mean?

According to a Texas Instruments patent disclosure, the 9900 instruction set is decoded by the number of leading zeroes.

This number selects the "entry point" to the CPU's control ROM (microcode). For each "entry point", instructions share a common operand decoding.

None or 1 is Format I

2 zeroes is Format III,IV,XOP

3 zeroes is Format II

etc

The "BYTE" bit isn't up front in the instruction. So whether it's 0 or 1, it can't affect the entry point. MOV has the same number of zeroes as A (add) and ADD definitely needs the destination to be read.

As a result, all Format I instructions fetch the destination operand, even MOV which doesn't need it.

9995

This CPU doesn't need to fetch or store a whole word if the instruction is a BYTE type. Special case: does MOVB avoid read-before-write. (I think it does?)

99000

The 99000 cpus eliminated read-before-write as a special case for MOV. Register to register MOV takes just 3 cycles, where MOVB still takes 4. (MOV: WR read, AUMS, WR write.)

TL;DR

Much more than just read-before-write.

We're going to decode the 9900 instructions at the bit level. You'll see the leading zeros concept throughout.

Format I instructions, A or MOV for example, begin with one or no zero. Format I have general destination and source operands.

Two leading zeroes means Format III: one destination register, one general source operand.

Format I and III examples

MOV
1100 Td dddd Ts ssss
MPY (format III)
0011 10 dddd Ts ssss

where Ts,Td are:

00 register, RX

01 register indirect , *Rx

10 symbolic @Y(R0)

or indexed @Y(Rx)

11 register indirect auto increment, *Rx+

dddd and ssss are register numbers.

Some longer tables sorted by leftmost bits, counting down.

Format I. One or no leading zero. Two general operands.
Opcode is 3 bit ALU operation code, one bit "BYTE" flag. Then 6 bits DST, 6 bits SRC. (Total 16 bits)

Opcodes:

1111 SOCB
1110 SOC (set ones) 
1101 MOVB
1100 MOV (move)
1011 AB (add byte)
1010 A (add)
1001 CB
1000 C (compare)
0111 SB
0110 S (subtract)
0101 SZCB 
0100 SZC (set zeroes)

Formats III, IV, XOP have 2 leading zeroes.

0011 xx MPY, DIV, LDCR, STCR 
0010 xx XOP, XOR, CZC, COC

These are 8 similar opcodes made using 6 bits. 2 zeroes, a 1, then 3 more bits. (2^3 = 8.) Leaving 10 bits for operands. Opcodes have started to nibble at operand fields, eating Td first.

Decoding:

001x xxdd ddTs ssss

The dddd is a 4 bit number, interpreted as register, shift count, or XOP number.

Finally 6 bits for one general source operand.

(Note source operand always occupies rightmost bits!)

2 + 1 + 3 + 4 + 6 bits = 16)

Format II has 3 leading zeroes. 16 opcodes identified by first 8 bits. Rest is an 8 bit offset:

0001 xxxx   Jumps and CRU single bit 

0001 1111 TB
0001 1110 SBO
0001 1101 SBZ
0001 1100 JOP
0001 1011 JH
...
0001 0000 JMP

(Notice that from xxxx, the CRU bit instructions are easy to separate. Jumps that test status bits have some 1s. Plain JMP has zeroes.

For jumps, you could easily write out sum-of-products equations of the xxxx bits and first 5 register bits.

Leading zeroes concept continues thru the end of instruction set.

4 zeroes is Format V: Shifts : SRA ...

5 zeroes is Format VI: CLR, BL, ...

6 zeroes is Format VII: LWPI,RTWP and VIII LI, CI,...

9995 and others have:

7 zeroes is MPYS,DIVS

8 zeroes is LWP, LST

Observation: the E/A Roman numeral is approximately the bit position of the first 1.

Swap Bus

The Byte variant of Format I causes the "Swap Bus" to meddle with the byte order of operand values. On their way between memory, ALU, then back to memory, they either sail through or get swapped as needed. It's like Scylla and Charybdis.

Swap Bus

The "Swap bus" is just more silicon control lines to multiplex the bytes into the left byte! The "Swap Bus" is 16 2-input multiplexers between the ALU and memory data register. (Abbreviated MD or MDR.)

Byte Swap isn't an extra clock cycle--it's just gates switched on or off as the value moves across the swap bus.

On the other hand, microcode for SWPB needs to tell it to swap on the front half, not the back half cycle. (Otherwise SWPB is like MOV!)

Memory -> MDR > Swap -> ALU B -> (no Swap) -> MDR -> Memory.

TL;DR again. Even more concerning microcode:

As said above, the number of leading zeroes selects the "entry point" to the CPU's control ROM (microcode) for an instruction group having the same operand decoding steps. Again, the "BYTE" bit isn't up front in the instruction, so it can't affect the entry point.

Control ROM is an array of silicon lines which pass across all other functional blocks of the CPU. A group of lines is like a punched card that turns on or off all the functional units of the CPU. Some lines are for the first 1/4 or 1/2 of the cycle, some for later.

(One clock cycle is really 4 internal clocks, and 4 steps is enough to:

1. Compute address by adding two things like WP + Rx

2. Set the memory bus address

3. Strobe Read or Write signal

4. Capture read word internally

Format I has activated one chunk of microcode, which does the work for two general operands.

Microcode for One-operand format , like Format III, would decode only the source field general operand.

Observe why the general dest field comes before the source field. Instructions with more bits in the opcode will eat up Td, then dddd. But general source can stay in a common field.

Decode general source could be a "subroutine" saving a lot of silicon space! (There might be other tricks.)

The location of source bits remains the same down to the instructions that leave just 4 bits for an operand! Of course in single general (CLR) or immediate register (AI) formats, the "Source" is also a destination.

LI seems to be a special case!

ALU

Microcode ultimately sets up the ALU with input operands A and B, and an operation code.

SOC  111 B = B OR A
MOV  110 B = B
A    101 B = B + A
C    100 B = -B + A (set status) 
S    011 B = -B + A (huh?)
SZC  010 B = B AND NOT A

NEG  ????  B = -B
INV  ????  B = NOT B
ABS  ????  this one has microcode
SETO ????  B = FFFF

SLA  ???? While(SC--), multiplex B bits from left or right neighbor bit

etc

A full set of 9-bit operation codes is known for the 990/12 and the ALU chip 74S181. See Bipolar Microcomputer Components Databook.

A set of 16 operation codes suffices for the 74LS181 ALU, a really common chip. See any TTL data book.

Historical Note: extra adders

In drafts of the schematic for the 99/5, a memory mapper used a set of 181s to add logical address and a base address.

171s used just for the add operation! A 16-bit adder allowed any 32-byte boundary.

(seen in first draft of 99/5 from Ron Wilcox to Don Bynum.)

In the 990 memory mapper, a 16-bit base address was also shifted left 5 bits then added to the logical address. This selected any 32-byte boundary for 2 megabytes max.

( The 990 supported 3 "segments" where one of 3 base address registers was added to the logjcal addresses. So a program segment could be mapped from any RAM address to any logical start address. You might have two shared code segments and a data segment.)

The 32-byte unit in a proposed home computer would match the 990 minicomputer exactly. Except in the home computer, the 64K would be divided into 32 little 2K windows. instead of beginning at 2K page boundaries.

Four extra 181s would be pricey for a home computer. The 181s were eliminated after a first draft for a 9&/5. To be absorbed into the 99/8 mapper chip.

apersson850 · March 27

1 hour ago, FarmerPotato said:

"to the designers distinct advantage" made sense to me--it was a trade-off, but what did it really mean?

Most probably that they could reuse quite a lot of the handling of the instructions if they behaved in the same way, byte or word.

As you also are into in your post, just more detailed.

+dhe · April 7

Finally had some time to go back and play with a small bit of assembly.

I've been proof reading an english translation of the german assembly books kursi.

One of the comments he made in an example was that, we needed to get the MSB in to the LSB position for the next instruction. He mentioned instead of using SWPB, we would use SRL, while that does move the MSB>LSB, you loose the value of the LSB (unless it's already zero! 😃 ).

Using Steve's Cheat Sheet, I was able to quickly grab the instructions FMT, using xdt99's listing, I quickly got the number of clock cycles. - using a shift instruction cost double the clock cycles as swpb, tested by single stepping with Classic99.

That got me thinking about another example I've seen, which is setting a register to zero, by using XOR. That also costs four additional clock cycles over LI or two additional clock cycles compared to CLR. With CLR being the easy winner for expressing intent.

Quote

     * test - does swpb and srl work the same, which is best?
0011 0026 0208 16        li   r8,>FF01                   fmt 8
     0028 FF01
0012 002A 06C8 18        swpb r8                         fmt 6
0013 002C 0208 16        li   r8,>FF01
     002E FF01
0014 0030 0988 36        srl r8,8                       fmt 5
0015               *                                      not same, lsb becomes 0
0016 0032 0208 16        li   r8,>FF01
     0034 FF01
0017 0036 0B88 36        src r8,8                       fmt 5
0018               *                                      does the same, but costs an extra 18 clock cycles
0021               * test Set Reg to Zero - most efficient.
0022 003A 0207 16        li   r7,>0007                   fmt 8
     003C 0007
0023 003E 04C7 18        clr r7                         fmt 6
0024 0040 0207 16        li   r7,>0007
     0042 0007
0025 0044 29C7 20        xor r7,r7                      fmt 3

+Lee Stewart · April 7

1 hour ago, dhe said:

Finally had some time to go back and play with a small bit of assembly.

I've been proof reading an english translation of the german assembly books kursi.

One of the comments he made in an example was that, we needed to get the MSB in to the LSB position for the next instruction. He mentioned instead of using SWPB, we would use SRL, while that does move the MSB>LSB, you loose the value of the LSB (unless it's already zero! 😃 ).

Using Steve's Cheat Sheet, I was able to quickly grab the instructions FMT, using xdt99's listing, I quickly got the number of clock cycles. - using a shift instruction cost double the clock cycles as swpb, tested by single stepping with Classic99.

That got me thinking about another example I've seen, which is setting a register to zero, by using XOR. That also costs four additional clock cycles over LI or two additional clock cycles compared to CLR. With CLR being the easy winner for expressing intent.

Of course, there can be many reasons for choosing one operation over another.

Regarding getting the MSB to the LSB, the point of using “SRL Rn,8” over “SWPB Rn” is usually to zero the MSB in the same instruction—saving instructions can outweigh speed. If you want to duplicate SWPB with a shift, “SRC Rn,8” will do nicely. Furthermore, if you want to test the result, the shift is the only way—SWPB does not affect the status register.

Regarding zeroing a register, CLR is, indeed, the clear (groan) winner—unless, for some reason, you need a comparison. Both XOR and LI affect the status register, whereas, CLR does not. LI also takes an additional 2 bytes over the other two, when dealing only with registers.

...lee

apersson850 · April 7

There's not much to learn by testing status bits after doing a CLR, regardless of whether it is possible or not. You already know that the result it pretty close to zero.

A method you haven't mentioned yet is taking advantage of the peculiar design of the TMS 9900. If you want to access R7 byte by byte you can simply do

MOVB R7,somewhere

MOVB @MYREGS+15,somewhere+1

In the case you don't know what registers you are running with you can do this

MOVB R7,somewhere

STWP R4

MOVB @15(R4),somewhere+1

The result at somewhere is the same as after

MOVB R7,somewhere

SWPB R7

MOVB R7,somewhere+1

The first two have the possible advantage that the content of R7 is not disturbed.

Assembly on the 99/4A

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members