Assembly guidance

matthew180 · March 15, 2010

The 16k RAM in the console isn't CPU-addressable, it's actually video memory. This is why TI BASIC and Extended BASIC are so sluggish; everything is passing through an 8-bit port to and from video memory.

Not just that, but the fact that BASIC and XB are written in GPL (an internal interpreted language built into the console), means that TI's BASIC and XB are double interpreted. When you see the difference in speed between BASIC and assembly, it makes you go WTF? I believe the TI has the slowest BASIC of all the classic home computers of the time. That's just another one of those things the designers did to really screw the design.

Matthew

Opry99er · March 15, 2010

Begs the argument that we should rebuild XB into a pure assembly Program and slap that mother on a cart. . Imagine the speeeeeeed

JamesD · March 16, 2010

Looks like TI did that because of the memory based registers.

So even if you load a byte, it's not in the correct position to use it for a loop counter without shifting? That's just wrong!

You are correct, loading a byte from a variable will load into the MSB of a register, so to use it as a loop counter you have to either clear the register, load the byte, and SWBP - or clear the register and load to the LSB with the trick I showed in a previous post. The other option is to use 2-byte variables, which is a waste sometimes, but perfectly acceptable if you are not going to be pushing memory limits. It is a pain in the ass though for sure.

Sugar coat it why don't you?

I'm thinking 2 byte variables is the way to go. You'd waste more memory on longer code than you'd save on the variables. Except in the case of dealing with a lot of data through 1 routine.

So, are TI BASIC tokens a BYTE or a WORD?

The Wiki says the 9900 only has a 15 bit address buss... so a max of 32K without bank switching I'm assuming?

So that explains why they never shipped the machine with over 16K RAM and when other computers jumped to 64K built in RAM they didn't follow. So the 32K RAM expansions must bank out the ROMs and you can't use it from BASIC?

Well, the 9900 has a 16-bit data bus and always accesses memory on an even address, so that would be 32K "words", which is 64K "bytes". The 16K that came with the computer was actually the dedicated VDP (9918A) memory! The only "real" 16-bit RAM in the machine is what is called the "scratch pad" memory at >8300 to >83FF. The 32K RAM expansion in the PEB is actually multiplexed to an 8-bit bus! Everything in the machine is pretty much on the multiplexed 8-bit bus except the scratch pad RAM and the console ROM, which makes the speed of the 16-bit CPU pretty much useless and not much of an advantage. The multiplexer adds 4 wait states for every memory access, and that on top of the read-before-write nature of the 9900 makes the system as a whole very slow.

Well, 32K words isn't any better than 32K bytes in this case because instructions are always made up of 16 bit words.

Look up evil in the dictionary and you'll find a photo of the engineers that worked on the TI-99/4.

So does Geneve eliminate the multiplexing?

+retroclouds · March 16, 2010

Very interesting topic people! And yes the design of the TI-99/4A might be, well let's just call it exotic

However the power of the TMS9900 CPU is really impressive if you ask me. Especially if you manage to get as much

as possible out of the 16 bit scratch pad memory. And don't forget we have hardware multiply (MPY) and divide (DIV)

instructions. Now how many home computers had that back in the days ?

It is really a shame there aren't that many high-quality arcade games available for the TI-99/4A.

But we all know the reason for that is not the machine itself, it was TI's marketing strategy.

Anyway, I for one am hoping to see new homebrew games written in assembly language

matthew180 · March 16, 2010

Well, 32K words isn't any better than 32K bytes in this case because instructions are always made up of 16 bit words.

Look up evil in the dictionary and you'll find a photo of the engineers that worked on the TI-99/4.

Yes, the opcodes are 16-bit, but that does not mean it takes any more RAM to perform a given task. On the 9900 you can add, sub, mul, div 16-bit values in 1 instruction. How many 8-bit opcodes does that take on a Z80 or 6502? I think you have to look at the solution to a specific problem as a whole and measure the final code; doing a comparison at the instruction level is not a practical instrument of measure. Personally I find the 16-bit values very handy and rarely do I find a need for larger values (one example would be a game score greater than 65535.) The 16-bit opcodes also pack more source and destination information into the instruction, so where an 8-bit opcode would require an additional byte or two following the instruction, the 9900 has that information in the single opcode. On the 99/4A I think the biggest problems are the implementation of the hardware, not the 9900 itself.

So, are TI BASIC tokens a BYTE or a WORD?

The built-in GPL language is a dense byte-code language, certainly to save space and to work with TI's GROM and GRAM chips. Since TI BASIC and XB are written in GPL, I'm pretty sure they too are byte oriented. I believe all the BASIC and XB tokens fit into the 0 - 255 range, but I don't know for sure, I try to know as little about XB as possible. ;-)

So does Geneve eliminate the multiplexing?

I don't know anything about the Geneve except: 1. they are hard to get and those that have them treat them like gold found from another planet. 2. they use the TMS9995, so I assume a lot of the problems of the 99/4A were addressed. But as far as I am concerned, the Geneve is not a 99/4A, not generally available, and is basically a complete computer on a card; personally I have little interest in them. Same with hard disks and 3rd party banking RAM cards, unless they are available to 100% of the community and people have them more times than not, they are simply things to tinker with.

Matthew

JamesD · March 16, 2010

Very interesting topic people! And yes the design of the TI-99/4A might be, well let's just call it exotic

Well, the design is what I call "clever". When hardware/software engineers are "clever" it's either really good or really bad. I've seen both. In this case I think it came back to bite TI in the butt and knocked them out of the market so I'd say bad. But the pieces were good enough to make it somewhat competitive.

However the power of the TMS9900 CPU is really impressive if you ask me. Especially if you manage to get as much

as possible out of the 16 bit scratch pad memory. And don't forget we have hardware multiply (MPY) and divide (DIV)

instructions. Now how many home computers had that back in the days ?

Yes, the opcodes are 16-bit, but that does not mean it takes any more RAM to perform a given task. On the 9900 you can add, sub, mul, div 16-bit values in 1 instruction. How many 8-bit opcodes does that take on a Z80 or 6502?

It really depends on what you are doing if it is a limitation or not.

The Z80 can deal with 16 bit pointers and some limited math as well. The special purpose nature of many registers and inefficiency of index registers vs HL ends up making the code much larger than it should have been. What good are index registers that you rarely use because HL is faster? It's stack usage from a C compiler isn't very efficient either. The 64180/Z180 made stack use from C better, reduced clocks/instruction by 20% and added a multiply. The chip showed up on a few accelerator boards and I got to program some embedded systems based on it.

The 6502 has to deal with 16 bit pointers on page zero and 16 bit variables outside the cpu. At first it seems inefficient but when you actually implement something I've found you can use an 8 bit index off of that pointer and only have to update the pointer every now and then so most math is 8 bit. When you do update the pointer, you have to add check the carry and add again. Similar results with variables. The code size is much larger but not all of it has to execute every time to deal with 16 bits. Multiplication and division is probably the worst but whenever possible people use tables so it depends on what you are doing. Lack of 16 bit registers makes things much more complicated than they should be. C compilers on the 6502 almost guarantee the parameter stack is going to be handled by a page zero pointer due to the 256 byte hardware stack and the code will be rather large.

The 6803 (Tandy MC-10) and 6809 (Tandy Color Computer, Dragon, Thompson MO,TO5/7/8/9) can add and subtract 16 bit numbers, multiply 8 bit numbers and if you drop a Hitachi 6309 in place of the 6809 you can multiply and divide 16 bit numbers and deal with up to 32 bit numbers. Sadly the Hitachi 6303 (6803 compatible) didn't put in a divide... at least not that anyone is aware of, but the 6309 additions were never announced either. The 6809 does well with any compiled language do to these features. The 6803 doesn't have stack relative addressing so it needs the stack pointer copied to the index register to access parameters passed on the stack, and with only one index register than means a lot of swapping will probably take place. It's similar to the Z80's limitations.

FWIW, if you have that many wait states on a 9900 then I wouldn't go counting clock cycles vs any other CPU.

If the 99/4A hadn't crippled the 9900 it might do well in number of clock cycles. I also think it should support a C compiler fairly well though I haven't seen enough to be sure.

I ported a simple music player for the AY sound chip from the Z80 to 6502, 6803 and 6809. The player/data format is pretty simple. You have a number for how many interrupts between changing registers, a number for how many registers to modify and then it just dumps the register number to modify and the register setting to the sound chip repeatedly till you have updated all the registers the data tells you to. And of coarse it checks to see if the end of the song has been reached.

The Z80 code is 148 bytes.

The 6502 code is 198 bytes, but it has 20 extra bytes for a new feature and the Oric interfaces the AY through another chip so additional setup is required. I'm guessing it's around 160 bytes without the extra overhead.

The 6803 code 108 bytes. Yes sports fans, almost 30% smaller code than the Z80 and almost half the 6502.

I haven't checked the 6809 code size but I know it's going to be even smaller. I might get the 6803 code under 100 bytes with a trick I have learned since, using the stack pointer as a 2nd pointer. I can only do that with interrupts disabled but it's an interrupt handler with the interrupts already getting disabled. If not smaller, the code will still be faster in the inner loop.

Ultimately, the Motorola chips are almost always going to require less code and fewer clock cycles than the 6502 or Z80 and not by a small amount.

The 9900? I'd have to port it and test the code on an emulator since no system with a 9900 has an AY sound chip. However, with minor changes it could probably work with the TI sound chip. The song data would be totally different though.

The 65816 borrows ideas from the 6809 and can deal with 16 bit numbers. It also increases the stack pointer to 16 bits and lets you access a lot of memory. However, it relies on mode switching to go between 8 and 16 bits so it's much less efficient than the 6809/6803 in that dept. However, it's much better than the 6502 and adequately supports a C compiler.

Code size might be a little bloated but you can actually use the hardware stack for parameters.

I think you have to look at the solution to a specific problem as a whole and measure the final code; doing a comparison at the instruction level is not a practical instrument of measure. Personally I find the 16-bit values very handy and rarely do I find a need for larger values (one example would be a game score greater than 65535.) The 16-bit opcodes also pack more source and destination information into the instruction, so where an 8-bit opcode would require an additional byte or two following the instruction, the 9900 has that information in the single opcode. On the 99/4A I think the biggest problems are the implementation of the hardware, not the 9900 itself.

While your argument has merit, the 9900's 8 bit limitations mean more 16 bit instructions to do what an 8 bit instruction might do in a byte. It goes two ways.

One thing that is important that hasn't been mentioned is how long it takes to write a program for a given CPU.

The special purpose nature of the Z80 registers means you have to write more code and it takes more time.

I didn't write the Z80 version of the music player but making it efficient is definitely going to take longer than the Motorola or TI chips.

The 6502 requires more time and code because of lack of 16 bit anything.

Both 6502 and Z80 programmers often use macros they have built up over time to save on coding time but it may result in some lost clock cycles if they get carried away. I still think Z80 and 6502 code will take longer to write than Motorola or TI.

The limited number of registers on the Motorola chips might require more register <> RAM swapping than the 9900 but things are pretty straightforward and you don't have to do any theatrics. Also, the 6809's LEA instruction, many address modes and movable direct page make it very easy to write fast code. It's also easy to write relocatable code on the 6809.

The 9900 cpu should be fairly quick to write code for... until you cripple it with the 99/4A's design and what you gain in not jumping through hoops instruction wise, you loose jumping through hoops system architecture wise.

It's a good thing the TI had sprites and sound chips. If it hadn't it wouldn't have lasted as long as it did.

+adamantyr · March 16, 2010

When I get my CRPG for the TI-99/4a finished (hopefully this year, don't hold me to it), I want to take a crack at conversion for other platforms myself.

That would be an interesting way to see how much difference the opcode structure makes. Would the Z80/6502 version end up being the same size, larger, or smaller? Faster in some places, slower in others?

I'd like to do the ZX Spectrum and MSX platforms for certain, and a C64/Apple/Atari 8-bit version. One good thing about Atari Age, I shouldn't have a problem finding interested volunteers.

Adamantyr

matthew180 · March 16, 2010

I ported a simple music player for the AY sound chip from the Z80 to 6502, 6803 and 6809. The player/data format is pretty simple. You have a number for how many interrupts between changing registers, a number for how many registers to modify and then it just dumps the register number to modify and the register setting to the sound chip repeatedly till you have updated all the registers the data tells you to. And of coarse it checks to see if the end of the song has been reached.

The Z80 code is 148 bytes.

The 6502 code is 198 bytes, but it has 20 extra bytes for a new feature and the Oric interfaces the AY through another chip so additional setup is required. I'm guessing it's around 160 bytes without the extra overhead.

The 6803 code 108 bytes. Yes sports fans, almost 30% smaller code than the Z80 and almost half the 6502.

The 9900? I'd have to port it and test the code on an emulator since no system with a 9900 has an AY sound chip. However, with minor changes it could probably work with the TI sound chip. The song data would be totally different though.

I actually just got done writing a sound player for Owen to support his XB game. It is written in assembly and the core of it runs during the VDP vsycn every 60th of a second, so the process sounds very similar to what you described above. Also, the SN76489 (TI's generic version of the sound chip used in the 99/4A) and the AY chip are very similar in functionality and programming (I've been looking into them lately, after doing the sound player for Owen.)

The main "player" is 144 bytes and I was simply going for functionality and compatibility when being called from XB. I could probably strip that back to a very compact player if I didn't have to worry about running in an XB environment (which I will be doing for the game I'm working on.) I also added nested looping and linking features in the sound data, so if I pulled out that ability the player would get even smaller. Are you counting support and setup functions in your code size? My player has to deal with passing variables to / from XB, so that adds overhead, but those functions are not part of the code that actually plays the sound data. The whole codebase for all the XB support functions, error checking, and the player is about 444 bytes.

Anyway, I posted a big huge detailed thread about it a few weeks ago here on A.A., should not be too hard to find since we only have two pages of posts. ;-) Check it out if you are interested.

Matthew

JamesD · March 16, 2010

I actually just got done writing a sound player for Owen to support his XB game. It is written in assembly and the core of it runs during the VDP vsycn every 60th of a second, so the process sounds very similar to what you described above. Also, the SN76489 (TI's generic version of the sound chip used in the 99/4A) and the AY chip are very similar in functionality and programming (I've been looking into them lately, after doing the sound player for Owen.)

The main "player" is 144 bytes and I was simply going for functionality and compatibility when being called from XB. I could probably strip that back to a very compact player if I didn't have to worry about running in an XB environment (which I will be doing for the game I'm working on.) I also added nested looping and linking features in the sound data, so if I pulled out that ability the player would get even smaller. Are you counting support and setup functions in your code size? My player has to deal with passing variables to / from XB, so that adds overhead, but those functions are not part of the code that actually plays the sound data. The whole codebase for all the XB support functions, error checking, and the player is about 444 bytes.

Anyway, I posted a big huge detailed thread about it a few weeks ago here on A.A., should not be too hard to find since we only have two pages of posts. Check it out if you are interested.

Matthew

The code sizes are everything but the sound data, that includes setup, play, and safe exit. I haven't converted them all to do the playing within the interrupt. The original Aquarius version busy looped looking for the vertical blank. I think I updated the current Z80 code to be fully interrupt driven but I know a few things can be optimized. Still, the code size would change to work exactly like yours. I'm sure someone with more Z80 experience could shave a little off it size or speed wise. I spent the most time on the 6502 code so I don't think they could cut much there. I don't think there will be a huge size change to any of them though. Setting the target as 65C02 (which the player supports) will drop a few bytes due to added instructions as would directly accessing the AY chip.

I have been looking at porting it to the 65816 and the 65816 could cut 30+ bytes off the 6502 code without even trying. I'm thinking it could get close to the 6803 but I'm not sure if it could beat it. The mode switching is ugly but it also supports some direct page addressing which may make up for that.

There are no nested loops or links in the current player but the added code for the 6502 lets me embed a bunch of commands which are called from a table. The added code is to make the table call which requires more code on the 6502.

The only command in the table at the moment is to exit. I could add commands to jump to different locations in the sound data and to use conditional loops. I just threw the jump table into the 6502 code a couple weeks ago as a proof of concept. BTW, the table jump code takes over 20 bytes... if I can use 8 bit offsets the jump code drops in size.

I'm guessing you could get the 9900 code size down close to the same size as the 6803 +- a few bytes, but that isn't fully optimized given what I know now. If I pull out all the stops on the 6809 code I don't think the 9900 has a chance. Direct page (DP) addressing can cut data accesses and code branches down to byte offsets and song data can be loaded off the user stack including loading multiple registers with one instruction. That's pretty tough to top size wise not to mention speed wise.

<edit>

No song loops or links is what I meant.

Edited March 17, 2010 by JamesD

JamesD · March 17, 2010

Edited March 19, 2010 by JamesD

matthew180 · March 23, 2010

Yes, tried it with classic99 and works ok, but the calculated length is too big.

Reason is that the TMS9900 is a 16 bit CPU and opcodes must be at even addresses.

That is why winasm automatically adjusted the address of label ND.

This causes the length to be 0C (12) instead of 0B (11).

   
  1  0000 0000        REF   VMBW           ; External function VMBW
  2                   AORG  >A000  <---- Assemble starting at absolute address >A000
  3  A000 0200 SFIRST LI    R0,1           ; VDP target address
  3  A002 0001  
  4  A004 0201        LI    R1,ST          ; Source address in RAM
  4  A006 A012  
  5  A008 0202        LI    R2,ND-ST       ; Number of bytes to write
  5  A00A 000C <----------- should be 0B  
  6  A00C 0420        BLWP  @VMBW          ; VMBW = Multiple Byte Write
  6  A00E 0000  
  7  A010 10FF        JMP   $              ; soft-halt ($ means program counter)
  8  A012 4845 ST     TEXT  'HELLO WORLD'
  8  A014 4C4C  
  8  A016 4F20  
  8  A018 574F  
  8  A01A 524C  
  8  A01C 44    
  9  A01D 0000        EVEN     *>>> Assembler Auto-Generated <<<
 10  A01E 0000 ND     END   SFIRST
 10

Yes, and that is one of the reasons I'm writing a new assembler, because Asm994a is wrong (again). I was doing some testing using the E/A assembler and noticed something strange, i.e. that my start address was at an odd address!

.
99/4 ASSEMBLER
VERSION 1.2                                                  PAGE 0001
 0001                   DEF  MAIN  
 0002                
 0003 0000   01  VAR1   BYTE 1   
 0004                
 0005            MAIN  
 0006 0002 0200         LI   R0,ND-ST  
      0004 000B  
 0007 0006 10FF         JMP  $   
 0008                
 0009 0008   48  ST     TEXT 'HELLO WORLD'   
 0010            ND     END  MAIN  
.
99/4 ASSEMBLER
VERSION 1.2                                                  PAGE 0002
 D MAIN    0001    ' ND      0013      R0      0000      R1      0001    
   R10     000A      R11     000B      R12     000C      R13     000D    
   R14     000E      R15     000F      R2      0002      R3      0003    
   R4      0004      R5      0005      R6      0006      R7      0007    
   R8      0008      R9      0009    ' ST      0008    ' VAR1    0000    
 0000 ERRORS

I'm in the habit lately of placing my labels on their own line, which is what made me notice this behavior. It also reminded me of this thread, so I wanted to post these corrections. Two things to note, 1. that MAIN starts at 1, not 0 or 2, and 2. that the TI assembler does not force the END directive to an even address... why should it? It is not a machine instruction. Now, if I move the label to the first instruction, then it comes out even:

.
99/4 ASSEMBLER
VERSION 1.2                                                  PAGE 0001
 0001                   DEF  MAIN  
 0002                
 0003 0000   01  VAR1   BYTE 1   
 0004                
 0005 0002 0200  MAIN   LI   R0,ND-ST  
      0004 000B  
 0006 0006 10FF         JMP  $   
 0007                
 0008 0008   48  ST     TEXT 'HELLO WORLD'   
 0009            ND  
 0010 0014 C000         MOV  R0,R0   
 0011                
 0012                   END  MAIN  
.
99/4 ASSEMBLER
VERSION 1.2                                                  PAGE 0002
 D MAIN    0002    ' ND      0013      R0      0000      R1      0001    
   R10     000A      R11     000B      R12     000C      R13     000D    
   R14     000E      R15     000F      R2      0002      R3      0003    
   R4      0004      R5      0005      R6      0006      R7      0007    
   R8      0008      R9      0009    ' ST      0008    ' VAR1    0000    
 0000 ERRORS

Also note that even if the ND label is followed by an instruction which must be aligned, it still gets the correct address. So, the original posted code should have worked, at least when calculating the length. The last thing I want to point out is that this functionality is by design. In the E/A manual on page 47 states:

"A source statement consisting of only a label field is a valid statement. It has the effect of assigning the current location to the label. This is usually equivalent to placing the label in the label field of the following machine instruction or assembler directive. However, when a statement consisting of only a label is preceded by a TEXT or BYTE directive and is followed by a DATA directive or a machine instruction, the label does not have the same value as a label in the following statement unless the TEXT or BYTE directive left the location counter on an even (word) location. An EVEN directive following the TEXT or BYTE directive prevents this problem."

Matthew

Willsy · March 23, 2010

Looks like TI did that because of the memory based registers.

So even if you load a byte, it's not in the correct position to use it for a loop counter without shifting? That's just wrong!

You are correct, loading a byte from a variable will load into the MSB of a register, so to use it as a loop counter you have to either clear the register, load the byte, and SWBP - or clear the register and load to the LSB with the trick I showed in a previous post. The other option is to use 2-byte variables, which is a waste sometimes, but perfectly acceptable if you are not going to be pushing memory limits. It is a pain in the ass though for sure.

Sugar coat it why don't you?

I'm thinking 2 byte variables is the way to go. You'd waste more memory on longer code than you'd save on the variables. Except in the case of dealing with a lot of data through 1 routine.

So, are TI BASIC tokens a BYTE or a WORD?

The Wiki says the 9900 only has a 15 bit address buss... so a max of 32K without bank switching I'm assuming?

So that explains why they never shipped the machine with over 16K RAM and when other computers jumped to 64K built in RAM they didn't follow. So the 32K RAM expansions must bank out the ROMs and you can't use it from BASIC?

Well, the 9900 has a 16-bit data bus and always accesses memory on an even address, so that would be 32K "words", which is 64K "bytes". The 16K that came with the computer was actually the dedicated VDP (9918A) memory! The only "real" 16-bit RAM in the machine is what is called the "scratch pad" memory at >8300 to >83FF. The 32K RAM expansion in the PEB is actually multiplexed to an 8-bit bus! Everything in the machine is pretty much on the multiplexed 8-bit bus except the scratch pad RAM and the console ROM, which makes the speed of the 16-bit CPU pretty much useless and not much of an advantage. The multiplexer adds 4 wait states for every memory access, and that on top of the read-before-write nature of the 9900 makes the system as a whole very slow.

Well, 32K words isn't any better than 32K bytes in this case because instructions are always made up of 16 bit words.

Look up evil in the dictionary and you'll find a photo of the engineers that worked on the TI-99/4.

So does Geneve eliminate the multiplexing?

No! It's 64K - it's just that TI didn't bother with the least significant bit, which makes perfect sense, since it's a 16 bit chip and only access memory on even byte boundaries.

The 9995 does have it though ;-)

JamesD · March 24, 2010

Also note that even if the ND label is followed by an instruction which must be aligned, it still gets the correct address. So, the original posted code should have worked, at least when calculating the length. The last thing I want to point out is that this functionality is by design. In the E/A manual on page 47 states:

"A source statement consisting of only a label field is a valid statement. It has the effect of assigning the current location to the label. This is usually equivalent to placing the label in the label field of the following machine instruction or assembler directive. However, when a statement consisting of only a label is preceded by a TEXT or BYTE directive and is followed by a DATA directive or a machine instruction, the label does not have the same value as a label in the following statement unless the TEXT or BYTE directive left the location counter on an even (word) location. An EVEN directive following the TEXT or BYTE directive prevents this problem."

Matthew

So if I'm following that correctly... my code should have worked but if you try to use the label that immediately follows the string to access code that follows, you might have the wrong address for the code unless you use EVEN or BYTE before the label and that would give the wrong string length for odd length strings.

If you are doing what I suggested you can't reuse the label for code but it *should* work with multiple strings placed back to back since those can be accessed on odd addresses.

You also couldn't just throw in a 2nd label for code unless you separate the two with EVEN or the 2nd label would still have the same addressas the first label.

That is the behavior I would expect.

matthew180 · March 25, 2010

So if I'm following that correctly... my code should have worked but if you try to use the label that immediately follows the string to access code that follows, you might have the wrong address for the code unless you use EVEN or BYTE before the label and that would give the wrong string length for odd length strings.

Correct. In the original code, the trailing label was intended to be used to let the compiler calculate the length of the string, and with Asm994a it was incorrectly forcing the label to an even address. I guess I would not try to use a label to represent the and of a string *and* the location of code.

So, any time a label follows a TEXT or BYTE directive, and that label is on a line by itself (or another TEXT or BYTE directive) the address could be odd or even, no matter what follows on the next line.

      BYTE 1
LBL1
LBL2   LI R0,1

LBL1 and LBL2 are not necessarily the same, but they *could* be. If they need to be, then an EVEN directive should precede the LBL1 label.

Matthew

JamesD · March 25, 2010

In the following code, ST2 *should* point to the byte after the string but it depends on the assembler. Same for NEXT. If the string is an odd length like this example, NEXT may not point to the code on the same line depending on the assembler you use. I'd want the behavior tested and documented before I'd say.

      REF   VMBW           ; External function VMBW
SFIRST LI    R0,1           ; VDP target address
      LI    R2,ST2-ST      ; Number of bytes to write
      LI    R1,ST          ; Source address in RAM
      BLWP  @VMBW          ; VMBW = Multiple Byte Write
      JMP   $              ; soft-halt ($ means program counter)
ST     TEXT  'HELLO WORLD'
ST2
NEXT   LI    R0,2           ; just here for the example

      END   SFIRST

The fix *should* be to insert the EVEN yourself.

Strings can be back to back without it but anytime code follows strings, you *should* use the EVEN. It's the only way to be sure NEXT is correct. I'd still worry about an assembler inserting an EVEN before ST2 until it has been tested.

      REF   VMBW           ; External function VMBW
SFIRST LI    R0,1           ; VDP target address
      LI    R2,ST2-ST      ; Number of bytes to write
      LI    R1,ST          ; Source address in RAM
      BLWP  @VMBW          ; VMBW = Multiple Byte Write
      JMP   $              ; soft-halt ($ means program counter)
ST     TEXT  'HELLO WORLD'
ST2
      EVEN
NEXT   LI    R0,2           ; just here for the example

      END   SFIRST

matthew180 · March 25, 2010

That is already tested. "NEXT" will be on an even address and point to the code because both assemblers will ensure that a machine instruction is on an even address. However, in Asm994a, the "ST2" label would also, incorrectly, be on an even address (the same address as NEXT.) Using the original TI assembler, your labels are as you would expect, i.e. ST2 is odd and points to the end of the string and NEXT points to the instruction.

Matthew

JamesD · March 27, 2010

No! It's 64K - it's just that TI didn't bother with the least significant bit, which makes perfect sense, since it's a 16 bit chip and only access memory on even byte boundaries.

The 9995 does have it though

So, does loading from 8 bit RAM load the MSB and there is no LSB at that WORD?

8 bit RAM would be at every even address? (has to be even... no address line to be odd)

<edit>I suppose that depends on the endian nature of the 9900)

That would explain how the one RAM expansion I saw worked. It could be 8 or 16 bit at the flip of a switch.

If you didn't mind half the RAM just going away it would be very easy to duplicate.

I'm guessing some address decoding to see if it's an 8 bit section of RAM and the switch is to 8 bit mode combined with the chip select signal.

Edited March 27, 2010 by JamesD

matthew180 · March 27, 2010

So, does loading from 8 bit RAM load the MSB and there is no LSB at that WORD?

8 bit RAM would be at every even address? (has to be even... no address line to be odd)

<edit>I suppose that depends on the endian nature of the 9900)

That would explain how the one RAM expansion I saw worked. It could be 8 or 16 bit at the flip of a switch.

If you didn't mind half the RAM just going away it would be very easy to duplicate.

I'm guessing some address decoding to see if it's an 8 bit section of RAM and the switch is to 8 bit mode combined with the chip select signal.

I'm not sure what "switch" you are referring to, but the 32K RAM expansion is 8-bit only.

A couple of things to clarify. RAM is pretty much universally measured in bytes, it does not matter how the CPU accesses it, the measurement is bytes. Modern 32 and 64 bit CPUs read chunks 4 and 8 bytes at a time, yet they can still deal with the individual bytes. To that end, the TMS9900 can address 64K bytes. However, it can only request bytes from memory two at a time, and the requested address will therefore always be even, no matter which byte you are working with.

Where the addressed byte goes totally depends on the instruction you are dealing with, and the destination specified. For example, the MOVB (move byte) instruction can move a single byte to / from any memory address, memory and a register, or register to register. When dealing with registers as the source or destination, the MSB (byte at the even address since the 9900 is bigendian) is the byte moved. When dealing with memory, it will always move the byte addressed. Some examples:

      MOVB R0,@>8301     // Move the MSB of R0 to memory address >8301
      SWPB R0            // Swap the MSB and LSB of R0
      MOVB R0,@>8303     // Move the MSB (what used to be the LSB) of R0 to >8303

      MOVB @>8300,@8301  // Move the byte at address >8300 to address >8301
      MOVB R0,R1         // Move the MSB of R0 to the MSB of R1

      // Copy the lower 32K to the upper 32K 1-byte at a time
      CLR  R0            // Set R0 to zero, source address >0000
      LI   R1,>8000      // Destination address >8000 (32K address)
      LI   R2,>8000      // Number of bytes to copy (32K)
LP1    MOVB *R0+,*R1+     // Move a byte and increment the pointers in R0 and R1 by *ONE*
      DEC  R2
      JNE  LP1

      // Copy the lower 32K to the upper 32K with 2-bytes at a time
      CLR  R0
      LI   R1,>8000      // Destination address
      LI   R2,>4000      // Copying 2 bytes at a time
LP2    MOV  *R0+,*R1+     // Copies 2 bytes and increments R0 and R1 by *TWO*
      DEC  R2
      JNE  LP2

For the auto-increment operations the CPU knows that it is doing a byte or word move and increments the pointer values accordingly. Anyway, that's the basics. The CPU is capable of working with any byte in the 64K address range, but depending on what you are doing you can sometimes get better performance by taking advantage of the 16-bit nature of the CPU. This is not particular to the 9900, all x86 CPUs have similar ways of doing things.

Matthew

JamesD · March 28, 2010

So, does loading from 8 bit RAM load the MSB and there is no LSB at that WORD?

8 bit RAM would be at every even address? (has to be even... no address line to be odd)

<edit>I suppose that depends on the endian nature of the 9900)

That would explain how the one RAM expansion I saw worked. It could be 8 or 16 bit at the flip of a switch.

If you didn't mind half the RAM just going away it would be very easy to duplicate.

I'm guessing some address decoding to see if it's an 8 bit section of RAM and the switch is to 8 bit mode combined with the chip select signal.

I'm not sure what "switch" you are referring to, but the 32K RAM expansion is 8-bit only.

http://www.mainbyte.com/ti99/32K16/32k16.html

A couple of things to clarify. RAM is pretty much universally measured in bytes, it does not matter how the CPU accesses it, the measurement is bytes. Modern 32 and 64 bit CPUs read chunks 4 and 8 bytes at a time, yet they can still deal with the individual bytes. To that end, the TMS9900 can address 64K bytes. However, it can only request bytes from memory two at a time, and the requested address will therefore always be even, no matter which byte you are working with.

:roll: Um, just an FYI I was a Computer Science major and also took a couple years worth of EE classes. I'm pretty familiar with what a byte is.

I did a little reading. Part of this is fact and some is is an educated guess.

If I followed the info correctly, the 9900 is missing A15, but technically it's missing the bit that would normally select the odd or even BYTE since the CPU has a 16 bit data buss. FWIW, on most CPUs that's A0.

The 9900 ALWAYS expects a 16 bit data buss and only addresses 32 x 1024 memory locations. On a 16 bit buss that is 64K but that's only 32K on an 8 bit buss and the CPU has no built in control logic to deal with an 8 bit buss.

To get around this there is interface hardware between the CPU and RAM that inserts wait states while it multiplexes the buss to load two consecutive BYTEs from the 8 bit buss to form a 16 bit word the CPU normally sees.

External logic tells the multiplexer whether an address is 8 or 16 bit.

The RAM expansion linked to above replaces the two 128 x 8 bit scratchpad RAMs on the 16 bit buss with two larger (32K x 8 bit?) RAMs. At the very least that provides access to the 16 bit data buss, many of the address lines and some control logic. Then the other interface & board mods alter the memory mapping hardware that tells the multiplexer what addresses are 8 or 16 bit and maps the new RAM based on switch settings.

The 8/16 bit mode switch must trigger several things.

First, it tells the multiplexer if the RAM address being accessed is 8 or 16 bit so it goes into the desired mode.

<edit> technically it tells the PAL to tell the multiplexer.

Second, it changes how the RAMs respond. The extra address line (chip select, whatever) from the multiplexer is ignored if the mode is 16 bit and both chips trigger at once to form the two haves of a 16 bit word on the data buss.

However, if the mode is 8 bit, the PAL uses this new address line to determine which chip responds on the data buss.

8 bit mode also tells the PAL to tie the data buss from the "odd" address chip to the same data lines as the other chip when it is selected. Data is actually output on the high bits and low bits of the 16 bit bus but the multiplexer ignores the high byte because it's only expecting 8 bits instead of 16.

The high bits on the 16 bit buss are tristated when in 8 bit mode or they would have to pass through the PAL so they could be disabled and I don't see that many lines to the PAL.

<edit> I'm talking about a write here

The other switches just modify the memory mapping.

Quite clever actually and I mean in a good way.

Where the addressed byte goes totally depends on the instruction you are dealing with, and the destination specified...

You are totally thinking software... I'm thinking about what's going on at the hardware level. To the program it's transparent except for the wait states and some differences in the memory map.

Edited March 28, 2010 by JamesD

Tursi · March 29, 2010

If I followed the info correctly, the 9900 is missing A15, but technically it's missing the bit that would normally select the odd or even BYTE since the CPU has a 16 bit data buss. FWIW, on most CPUs that's A0.

More or less correct. TI numbered the CPU bits backwards, so on the 9900 the "missing" bit (the LSB) is A15.

To get around this there is interface hardware between the CPU and RAM that inserts wait states while it multiplexes the buss to load two consecutive BYTEs from the 8 bit buss to form a 16 bit word the CPU normally sees.
External logic tells the multiplexer whether an address is 8 or 16 bit.

Close enough for government work.

The RAM expansion linked to above replaces the two 128 x 8 bit scratchpad RAMs on the 16 bit buss with two larger (32K x 8 bit?) RAMs. At the very least that provides access to the 16 bit data buss, many of the address lines and some control logic. Then the other interface & board mods alter the memory mapping hardware that tells the multiplexer what addresses are 8 or 16 bit and maps the new RAM based on switch settings.

(edit) My mistake, I didn't read the mod description well enough. This version DOES replace the onboard RAM chips. This isn't too big a change from the discrete logic version - the PAL would make it easier. The new 32k RAM chips provide enough RAM for the entire memory space, it's just when you enable it that counts.

The other changes change the circuitry that enables/disables the multiplexer and wait state generator, yes.

The 8/16 bit mode switch must trigger several things.
First, it tells the multiplexer if the RAM address being accessed is 8 or 16 bit so it goes into the desired mode.

<edit> technically it tells the PAL to tell the multiplexer.

Second, it changes how the RAMs respond. The extra address line (chip select, whatever) from the multiplexer is ignored if the mode is 16 bit and both chips trigger at once to form the two haves of a 16 bit word on the data buss.

No. All the switch does is restore the original performance of the machine by optionally allowing the wait state generator to still respond to 32k memory accesses. The memory access remains a 16-bit access after doing the modification, the machine just waits the usual amount of time before continuing. It has to, because the multiplexer circuit can't get access to the full memory on the 16-bit bus.

However, if the mode is 8 bit, the PAL uses this new address line to determine which chip responds on the data buss.

As far as I know, the guts of that modification are the same as the discrete logic version, which Mainbyte also covers here:

http://www.mainbyte.com/ti99/16bit32k/32kconsole.html

Part 1 doesn't cover the speed switch, but I posted a detail of how this mod works here:

http://harmlesslion.com/text/128k%20On%2016%20Bit%20Bus.pdf

(disclaimer - I need to update this - the 128k is not stable and I know why, just haven't updated the doc. The 32k description is solid.)

You are totally thinking software... I'm thinking about what's going on at the hardware level. To the program it's transparent except for the wait states and some differences in the memory map.

Byte access on the 9900 is always a word access. The processor reads the entire word, and modifies the requested byte internally, then writes the whole word back out. The CPU itself has no concept of the multiplexer circuitry or even any need of it, it just knows SOMETHING keeps telling it to slow down!

Edited March 29, 2010 by Tursi

JamesD · March 30, 2010

After looking at the pics of the motherboard it's obvious I was over thinking what TI did. I was expecting a custom chip and found something more like a simple 16 bit IDE to 8 bit CPU interface. What I suggested with the RAM upgrade was way more complex and didn't make much sense since the buss doesn't go through a special chip that does both the multiplexing and wait states. Not sure why it takes so many wait states but then I'm not familiar with the 9900.

BTW, on the 128K upgrade... I'd try modding it to bank out the ROMs leaving RAM in it's place. Then bank the upper portion of RAM as well. That might prove most useful. That would leave 3 RAM banks in the RAM area and 1 in the ROM area.

So how much difference in speed is there on a machine with the 16 bit RAM upgrade?

I rewrote the 6803 music player to be fully interrupt driven. It bumped up the size to about 130 bytes but I haven't really tried to optimize for size. A lot of that is setup, start, stop, cleanup. The actual interrupt handler is 43 bytes and the inner loop is 4 instructions totaling 18 clock cycles. If Motorola had optimized the instructions it should have taken almost half that number of clock cycles. I can see why some of the VHDL/Verilog CPU cores say they use half the number of clock cycles.

One thing about the 9900, it's register setup would probably work well with how GCC works if someone wanted to do a port.

Tursi · March 30, 2010

After looking at the pics of the motherboard it's obvious I was over thinking what TI did. I was expecting a custom chip and found something more like a simple 16 bit IDE to 8 bit CPU interface. What I suggested with the RAM upgrade was way more complex and didn't make much sense since the buss doesn't go through a special chip that does both the multiplexing and wait states. Not sure why it takes so many wait states but then I'm not familiar with the 9900.

Have a read through http://nouspikel.group.shef.ac.uk/ti99/titechpages.htm - nobody has documented the hardware of the machine better. He has a fantastic description of the wait state generator, including what happens when you disable each of the waits and a way to run the entire system with NO wait states at all.

BTW, on the 128K upgrade... I'd try modding it to bank out the ROMs leaving RAM in it's place. Then bank the upper portion of RAM as well. That might prove most useful. That would leave 3 RAM banks in the RAM area and 1 in the ROM area.

There's plenty of ways to do it. However, banking out the ROMs is a much bigger task. All I did was attach spare RAM pins to spare IO pins. It was never meant to be a useful mod and my document attempts to emphasize this.

The reason it's not stable is that 'high' on the 9901 is only 2V.. not quite high enough to be reliable on the RAM chips. I recently disabled that mod in my machine so my chips just run as normal 32k.

So how much difference in speed is there on a machine with the 16 bit RAM upgrade?

The speed boost is usually considered to be about 50% on average, since all access to external devices, sound, and VDP still trigger wait states.

One thing about the 9900, it's register setup would probably work well with how GCC works if someone wanted to do a port.

I started it, based on the PDP-11 port, and had some success, but got busy with other projects and left it half-finished: http://harmlesslion.com/hl4m/viewtopic.php?f=1&t=324

My problem is I really didn't follow a lot of what GCC was doing. The point at which I left it I knew what the next issue was, but I had to go back and redo the conditionals, because I'd backed myself into a corner. To pick it up again I'd probably be mostly starting over, since I don't remember much of what I learned.

Edited March 30, 2010 by Tursi

Assembly guidance

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members