F18A

Willsy · April 15, 2012

32 deep would be fine - more than enough I would say. Who is going to nest 16 levels deep? ;-)

Am I right in thinking the GPU has a fixed workspace; or to put it another way, there is no "workspaces" as such (no LWPI instruction). If so, then a data stack would be useful. Implementation probably isn't straightforward though. I don't envy you there.

How do you find room in the instruction formats for things like:

mov r0,stack

mov stack,r0

mov *r0,stack

mov stack,r0

etc etc

IIRC, the opcodes formats are quite densely packed on the 9900 - there's no room for a 17th register.

That would mean treating the stack as a memory-mapped device. The advantage of that is that standard 9900 assemblers could be used, as the stack is just a memory address; the magic happens in the GPU which traps the memory address access, does a push or a pop according to the instruction being executed.

E.g. instead of the above, you have:

mov r0,@stack

mov @stack,r0

mov *r0+,@stack

mov @stack,*r0+

As you can see, these are just bog standard 9900 instructions.

Just musing out loud here...!

Tursi · April 15, 2012

The only thing that really hampers using one of the 9900's existing registers as a stack pointer (something that some other uPs do!) is the lack of a pre-decrement mode. With post-increment and pre-decrement, you have everything you need for single-instruction stack access.

Willsy · April 15, 2012

That's a very good point indeed. Whether or not it is possible to represent pre-decrement in the various op-code formats remains to be seen however - I suspect there are no bits available to represent pre-decrement mode. Maybe something could be done whereby pre-decrement operations are restricted to registers 0 to 7? In fact, if it was restricted to a single register I'd be fine with that too ;-)

Again though, since you would have to steal a bit from somewhere to indicate pre-decrement that might restrict you to working with 8 source and/or destination registers.

jstimson · April 15, 2012

I have one of these on pre-order and am eagerly awaiting it, however am more than content to wait for the product to be finalized when Matthew is content with it.

I am (or used to be) a more than competent XB programmer but never managed to delve into the assembly area of the TI and, with current demands on my time, probably won't be able to do that until I retire in a long time.

For all of these extra features being placed into the unit, how many of them (if any) will be accessible from Basic or XB? Even if it's through CALL LOAD and CALL PEEK that would be cool. Although I'm primarily looking at the F18A as a display device, being able to access the extra features without deep code would be a huge bonus.

matthew180 · April 15, 2012

I have one of these on pre-order and am eagerly awaiting it, however am more than content to wait for the product to be finalized when Matthew is content with it.

And I appreciate that very much! I am working on getting it done as fast as possible.

I am (or used to be) a more than competent XB programmer but never managed to delve into the assembly area of the TI and, with current demands on my time, probably won't be able to do that until I retire in a long time.

I'm going to toss out my standard response here and say assembly is not as hard as people think. I think the "idea" that assembly is hard keeps more people away from it, rather than people actually trying and failing at learning assembly itself. Also, Mark Wills has done an awesome job at making Forth and making it available, which is a good intermediate step with better speed than XB and none of the baggage.

For all of these extra features being placed into the unit, how many of them (if any) will be accessible from Basic or XB? Even if it's through CALL LOAD and CALL PEEK that would be cool. Although I'm primarily looking at the F18A as a display device, being able to access the extra features without deep code would be a huge bonus.

Well, probably none of the ECMs (enhanced color modes) because B/XB totally own the VRAM and use it for everything. The scrolling support could be activated though, since that only requires access to the enhance VDP Registers. The 32-sprites at a time works fine and is available without having to do anything. Also, the GPU itself could still be utilized since it has its own dedicated 2K block of RAM. The 32-bit counter and random number generator would be usable, as well as the programmable palette registers.

Once I'm done with the board and have them delivered, I'll spend some time making an F18A "CALL LOAD" interface function for XB (unless someone beats me to it).

matthew180 · April 15, 2012

Am I right in thinking the GPU has a fixed workspace; or to put it another way, there is no "workspaces" as such (no LWPI instruction). If so, then a data stack would be useful.

Correct, no workspace pointer. The registers are based on a fixed "register file" that is not part of the RAM address space.

How do you find room in the instruction formats for things like:

mov r0,stack

mov stack,r0

mov *r0,stack

mov stack,r0

I don't think I would go that route. Probably just something like:

PUSH Rn

POP Rn

IIRC, the opcodes formats are quite densely packed on the 9900 - there's no room for a 17th register.

Yes they are. There are 4-bits allocated in most opcodes to specify a register, and 2-bits to indicate the "mode" (direct, indirect, symbolic, or indexed). But, like above, I would probably implement instructions that deal with the stack rather than treating it as another register.

matthew180 · April 15, 2012

The only thing that really hampers using one of the 9900's existing registers as a stack pointer (something that some other uPs do!) is the lack of a pre-decrement mode. With post-increment and pre-decrement, you have everything you need for single-instruction stack access.

Well, there is also the issue of the B instruction's little quirk that makes this sequence *not* work (R12 would hold an address of the stack):

MOV R11,*R12+

. . .

DEC R12

B *R12

That is why I was thinking about a CALL and RET instruction that would perform:

CALL = PC -> stack, stack pointer +1, src -> PC

RET = stack pointer -1, stack -> PC

PUSH = Rn -> stack, stack pointer +1

POP = stack pointer -1, stack -> Rn

Although I'm not sure if I can dig up four more unused opcodes that have the correct format, but I think I could. The next question is, would these instructions be worth having, or just *nice to have* ?

Edited April 15, 2012 by matthew180

Stuart · April 15, 2012

Trivia: Some stack support was added to the TMS99000 (with an extra '0' ) family in the form of two instructions:

BLSK - Branch immediate and push link to stack.

Opcode 0000 0000 1011 XXXX

1st word of instruction is opcode + register number

2nd word of instruction is immediate operand (IOP)

Function:

(W) - 2 --> (W)

(PC) + 4 --> (W)

IOP --> (PC)

So the stack pointer is in a register and the stack fills from the top downwards?

BIND - Branch indirect

Opcode 0000 0001 01XX XXXX

1st word of instruction is opcode +Ts + S fields

Function:

(SA) --> (PC)

"The BIND instruction serves as the inverse of a BLSK instruction if the register indirect autoincrementing addressing mode is used."

S.

32 deep would be fine - more than enough I would say. Who is going to nest 16 levels deep?

Am I right in thinking the GPU has a fixed workspace; or to put it another way, there is no "workspaces" as such (no LWPI instruction). If so, then a data stack would be useful. Implementation probably isn't straightforward though. I don't envy you there.

How do you find room in the instruction formats for things like:

mov r0,stack

mov stack,r0

mov *r0,stack

mov stack,r0

etc etc

IIRC, the opcodes formats are quite densely packed on the 9900 - there's no room for a 17th register.

That would mean treating the stack as a memory-mapped device. The advantage of that is that standard 9900 assemblers could be used, as the stack is just a memory address; the magic happens in the GPU which traps the memory address access, does a push or a pop according to the instruction being executed.

E.g. instead of the above, you have:

mov r0,@stack

mov @stack,r0

mov *r0+,@stack

mov @stack,*r0+

As you can see, these are just bog standard 9900 instructions.

Just musing out loud here...!

matthew180 · April 15, 2012

Trivia: Some stack support was added to the TMS99000 (with an extra '0' ) family in the form of two instructions:

BLSK - Branch immediate and push link to stack.

BIND - Branch indirect

Wow, that is interesting trivia! I'll have to go see where those opcodes fit in with the 9900's instruction set. Thanks for the info.

Tursi · April 15, 2012

Well, there is also the issue of the B instruction's little quirk that makes this sequence *not* work (R12 would hold an address of the stack):

MOV R11,*R12+

. . .

DEC R12

B *R12

Yeah, but you wouldn't do it that way. Look how your push and your pop operations are different.

MOV R11,*R12+ * PUSH

...

DEC R12

MOV *R12,R11 * POP

B *R11

The predecrement combins the DEC and the MOV, not the MOV and the B.

The main issue with this idea, or new instructions, is whether the assembler toolchains support them. Asm994A doesn't and can't be extended, and I think most Windows users use that for cross assembling. There's the command-line toolchain which I believe does have source available, so it could be extended. Other than that... it won't see wide acceptance if it has to be hand-assembled. Doesn't mean it's not worth it if it's simple of course.

matthew180 · April 15, 2012

Actually, you can use a DATA statement right where you want the new instruction. I did that with Win994a already for the CKON and CKOF instructions that I use for the SPI interface. It is a little cumbersome at first, and would only be necessary until some point where I can get an assembler out there. I have one started already, and it could have an initial released easy enough if I trim back some of the features I had planned for now.

Here is an example code I wrote to read the SPI Flash via the GPU:

*      Load the SPI ID.
      LI   R0,>9F00          * >9F read ID command
      LI   R1,43
      LI   R2,4
      DATA >03A0             * CKON
      LDCR R0,8
LP4    STCR R0,8              * Read from SPI
      AI   R0,>2D00          * Add offset 45 to get to ASCII range
      MOVB R0,*R1+           * MMA?
      DEC  R2
      JNE  LP4
      DATA >03C0             * CKOF

matthew180 · April 15, 2012

I also just realized you can use and EQU with the opcode value, so the intermediate use becomes a little more readable:

CSON   EQU  >03A0             * SPI chip select enable
CSOFF  EQU  >03C0             * SPI chip select disable
.
.
.
      LI   R2,4
      DATA CSON              * SPI CS enable
      LDCR R0,8

This should work with any assembler, even the stock E/A.

Tursi · April 16, 2012

There you go, that's a good workaround.

If you do go for PUSH and POP (which I wouldn't mind), it would still be nice if it used a standard register, so we don't need to use workarounds to manipulate it. Maybe dedicate R15 or something. I'd also personally like to see it use standard memory, and not a dedicated memory space.

Just my (next) two bits on that!

matthew180 · April 17, 2012

Quick update to introduce a new feature:

The Bitmap Layer lives! Although broken at the moment, this *is* the initial test after a few days of hacking, and I should have it straightened out by tomorrow.

I just could not resist adding some sort of true bitmap support, so I added what I call the "bitmap layer" or BML. The main difference with the BML is, it is *not* a screen mode, it is a layer (hence the name) and can actually be active in any graphics mode!

The BML uses 2-bits per pixel, so it is limited to 4 colors from any of 16 palettes. The BML can have priority over tiles, or only show up where a tile's pixel is transparent. The BML is always behind sprites and is not affected by the scroll registers.

The BML is sizable from 0 to 255 pixels in both directions, and can be located at any pixel location. This means you can have a bitmap "window" anywhere you want, and move it around by simply changing its location x,y registers. I suppose you can think of it as a sizable-sprite, but it does not interact with the other sprites as far as collision goes, and it does not have a "transparent" color (maybe... hmm...)

Because the BML is sizable, it is also very memory efficient. For example, if you have a 48x40 BML set up, it will only require 480 bytes of VRAM! If you use the whole screen (256x192), then you will need 12,288 bytes, which is the same amount of VRAM you need for a full GM2, but you get real pixels - yet fewer colors. There are always trade-offs though, without more VRAM.

The GPU also has a new instruction specifically for calculating pixel addresses and can read/set a pixel from X,Y coords in a single instruction. This should make plotting *very* fast and a lot easier. The same instruction can also calculate, from X,Y coords, the byte offset for a pixel in GM2. The new instruction can do other stuff too, but I'll write about that in the future once it is all working *correctly*.

Another note, I'm almost 100% with features, even if I wanted to keep adding stuff, I'm at 96% utilization of the FPGA and I'm starting to worry that the compiler might not be able to route everything if I try to stuff any more in there. I'm spent anyway. :-) Except for the PUSH and POP instructions. ;-) Geeze, there always seems to be one more thing!

+retroclouds · April 17, 2012

Hi, I have a question on the GPU. I think I read a while ago, that it has a 2K RAM area to its disposal.

Would it be possible to use the VDP RAM as wel for running code from there (or storing memory there) ?

I mean you could briefly turn the screen off, do you GPU code stuff and then turn the screen back on after having it restored.

That way you'd have 18K (16K+2K) to your disposal.

Or suppose you are in text mode, you'd still have plenty of VRAM left, so could use the remaining part for running a program or something.

Possible use? think of compressesion/decompression/unzip/zip kind of thing.

EDIT: I know such thing is not possible on a real 9918VDP (unless you would be using a GPL interpreter which will not make it faster), but considering this is a FPGA and VRAM is simulated.

Edited April 17, 2012 by retroclouds

kl99 · April 17, 2012

Hi retroclouds!

As far as I understand you can execute code from the VRam (16k+2k) through the GPU (which is a 9900 clone @ 25 mhz). The GPU has direct access to the VRam. The VRam is not a memory mapped device for the GPU.

@BML-idea: i like it!!!! This baby has a lot of "sprite" potential! 32 Sprites in a line, Linked Sprites, 4-color sprites, hardware scrolling, and now BML! all at once if you want. wow!

Willsy · April 17, 2012

Hi, I have a question on the GPU. I think I read a while ago, that it has a 2K RAM area to its disposal. Would it be possible to use the VDP RAM as wel for running code from there (or storing memory there) ? I mean you could briefly turn the screen off, do you GPU code stuff and then turn the screen back on after having it restored. That way you'd have 18K (16K+2K) to your disposal. Or suppose you are in text mode, you'd still have plenty of VRAM left, so could use the remaining part for running a program or something. Possible use? think of compressesion/decompression/unzip/zip kind of thing. EDIT: I know such thing is not possible on a real 9918VDP (unless you would be using a GPL interpreter which will not make it faster), but considering this is a FPGA and VRAM is simulated.

IIRC, as KL99 says, the VDP memory, from the GPUs perspective is just ordinary memory. For example, you can read/write the 'video ram' with normal MOV instructions; no need to go through a port - in fact, there isn't one!

So, for example, to clear the screen (GPU code):

CLR R0 ; screen address
LI R1,32*256 ; space character in msb
LI R2,32*24 ; number of characters
LOOP MOVB R1,*R0+ ; write a space character to video ram
DEC R2 ; decrement counter
JNE LOOP ; repeat if not finished

Matthew, am I correct in thinking there are no 16-bit word based instructions, only byte instructions? (E.g. no MOV - MOVB only etc.)

Tursi · April 17, 2012

Okay.. that's seriously cool. We'll need an update to the register documentation, I think there are a couple of things I want to try now.

Like I'm not behind on enough projects, hehe

Willsy · April 17, 2012

Yeah, this is just *bonkers* cool. There's enough here to keep me going for years!

kl99 · April 17, 2012

Matthew, am I correct in thinking there are no 16-bit word based instructions, only byte instructions? (E.g. no MOV - MOVB only etc.)

If I got it right, there are 16bit instructions as well, they just take longer to execute because the f18a internal interface between GPU and VPU was designed 8bit only.

I am excited as well!

matthew180 · April 17, 2012

There seems to be some confusion around the GPU, so let me clarify its capabilities and features:

* Full implementation of the 9900 with all instructions *except* those that deal with the workspace pointer or CRU. These are the unimplemented instructions: BLWP, LREX, LWPI, RTWP, RSET, STST, STWP, XOP

* Fixed registers. There is no "workspace pointer" and the registers are not stored in any of the RAM, so commands like BLWP do not make sense and would not work correctly. Think of the workspace pointer as always being set to >0000 if you must.

* New, or re-purposed instructions:

IDLE - causes the GPU to go idle and wait for a trigger from the host CPU via a new register

CKON - SPI Flash chip enable to '0'

CKOF - SPI Flash chip enable to '1'

LDCR - write a byte to the SPI Flash

STCR - read a byte from the SPI Flash

SLL - shift left logical

SLC - shift left circular

XOP - now called PIX, is the new pixel command.

* The GPU has full direct access to the 16K VRAM. In addition is has its own 2K of RAM that I call GPU-RAM or GRAM just to be confusing. :-) The GPU memory map looks like this:

VRAM 14-bit, 16K @ >0000 to >3FFF (0011 1111 1111 1111) VDP RAM
GRAM 11-bit, 2K  @ >4000 to >47FF (0100 x111 1111 1111) GPU RAM
PRAM  6-bit, 64  @ >5000 to >5x3F (0101 xxxx xx11 1111) Palette RAM, 64 locations, 12-bit each
VREG  6-bit, 64  @ >6000 to >6x3F (0110 xxxx xx11 1111) VDP Registers, 64 regs, read and write
current scanline @ >7000 to >7xxx (0111 xxxx xxxx xxxx) current scan line, 0 to 191 / 240
32-bit counter   @ >8000 to >8xx6 (1000 xxxx xxxx x110) GPU's own private counter
32-bit RNG       @ >9000 to >9xx6 (1001 xxxx xxxx x110) GPU's own private RNG
F18A version     @ >A000 to >Axxx (1010 xxxx xxxx xxxx) 1 byte

All the memory looks and works like memory, even the VDP registers. For example, to update the horizontal scroll register by 1, you could do this:

ONE  BYTE 1
.
.
.
AB @ONE,@>601B     * horz-scroll the screen by 1 pixel

* The GPU can read and write the VDP registers just like memory. Same with the palettes, except palette access is is always done with a word instruction since the palettes are 12-bits each.

* VRAM and GRAM are just like 16-bit RAM in the 99/4A, and you can store code and data in either and execute from either. The only difference between VRAM and GRAM is, only the GPU can address and access the GRAM, you can't access the GRAM from the host system CPU interface, and the video generation circuits can't use it for anything screen image generation. As far as the GPU is concerned, >0000 to >47FF is one linear chuck of RAM.

* Byte vs word instructions take the same time to execute except for maybe one or two 10ns-cycles. For byte ops the GPU skips writing the second byte, so I suppose that makes them imperceptibly faster. Word instructions, just like the real 9900, are limited to even addresses and will read 2-byte from memory. If you try something like this:

MOV @>0001,R1

R1 will contain the two bytes at >0000 and >0001, NOT the bytes from >0001 and >0002. This is consistent with the real 9900. Same going the other way, if you try:

MOV R1,@>0001

You will be writing two bytes to addresses >0000 and >0001. The MSB of the registers goes to the even address, again, just like the 9900.

The good thing is, most data in the video RAM will be aligned on an even address, so the word instructions do let you perform bulk memory operations faster. For example, you can clear the screen with a word op instead of a byte op, and only have to loop 384 times instead of 768 times.

      LI   R0,>2020      * two spaces, >20 == 32
      CLR  R1            * the name tables is at >0000
      LI   R2,384        * 384 words == 768 bytes
LP1    MOV  R0,*R1+       * write to name table, inc address
      DEC  R2
      JNE  LP1

Even the VDP Registers can be read/written with word ops, but you better be careful when writing them, since the same "even address" rules apply.

* Most instructions *execute* in a single 10ns-cycle, meaning the time it takes to move, compare, shift, multiply (divide is not included in this list), xor, etc. However, every instruction has the same fetch and decode time, the same time to get source or target data / addresses based on the instruction's addressing parameters, etc. So, an entire fetch, decode, load, execute, store cycle totally depends on the instruction and its source / destination addressing. Just like the real 9900. On average, a complete cycle takes 140ns or so, which is where the 7-MIPS number comes from, but the internal state-machine is clocked at 100MHz.

* The GPU has its own counter and RNG, and the host CPU also has a counter and RNG.

Edited April 17, 2012 by matthew180

matthew180 · April 17, 2012

Okay.. that's seriously cool. We'll need an update to the register documentation, I think there are a couple of things I want to try now.

Glad you like it. :-) I'll get the updated register use doc and bit-stream posted as soon as the BML is debugged.

Like I'm not behind on enough projects, hehe

Preaching to the choir. :-) I think the only thing that would stop me from continuing to add features is, I'm out of slices on the FPGA.

Tursi · April 17, 2012

What's the difference between SLL and SLA? My understanding all this time is that there was no need for SLL since shifting left, you don't need a sign extension so it wouldn't matter?

matthew180 · April 17, 2012

Good question. I don't know. HDL offers 6 shift instructions, so I added the two that HDL had but the 9900 did not. I'll have to go look.

Edit:

Ok, I looked. And it is interesting (I guess). The VHDL SLA duplicates the LSb. So, if you have "0001" and you SLA 2, then you would end up with "0111". If you started with "0010" and you SLA 2, then you would have "1000". The wikipedia page on "shifting" states that the VHDL implementation is unusual and not typical.

http://en.wikipedia.org/wiki/Arithmetic_shift

In reading that, I'll probably remove SLA just to eliminate confusion, and free up a few resources.

Edited April 17, 2012 by matthew180

matthew180 · April 18, 2012

As promised, a working BML! I also added a transparency ability, which is why you don't see a box like in the original photo above in post #589. It works like this (assuming the BML has been enabled):

pri = priority over the tiles
trs = transparency on (1) or off (0)
PZ = bitmap pixel including zero: 00, 01, 10, 11
PX = bitmap pixel that is not zero: 01, 10, 11
BG = background color
T = tile pixel
2 = a tile color other than transparent: 1 - 15
A tile pixel is "transparent" when its color is 0 (not 1 - 15).

 BML  BML  BML  TILE  TILE  result
 pri  trs  pix  pix   colr  pixel
-----------------------------------
  0    0   PZ    1     2      T
  0    0   PZ    0     2      T
  0    0   PZ    1     0      PZ (00 is black, not transparent, no BG)
  0    0   PZ    0     0      PZ (00 is black, not transparent, no BG)

  0    1   PX    1     2      T
  0    1   PX    0     2      T
  0    1   PX    1     0      PX
  0    1   PX    0     0      PX
  0    1   00    1     0      BG (BML and tile pixels are both transparent)
  0    1   00    0     0      BG (BML and tile pixels are both transparent)

  1    0   PZ    1     2      PZ (00 is black, not transparent, no BG)
  1    0   PZ    0     2      PZ (00 is black, not transparent, no BG)
  1    0   PZ    1     0      PZ (00 is black, not transparent, no BG)
  1    0   PZ    0     0      PZ (00 is black, not transparent, no BG)

  1    1   PX    1     2      PX
  1    1   PX    0     2      PX
  1    1   00    1     2      T
  1    1   00    0     2      T
  1    1   00    1     0      BG
  1    1   00    0     0      BG

Whew! Basically, if the bitmap layer does not have transparent enabled, then a "00" color is considered black and you will never see any background color. Normal tiles work this way too, you can have a pattern pixel of '0', yet give it color. The BML priority determines which is on top, BML or tiles, and if the BML is on top and transparency is not enabled, then you will not see any tiles behind the BML window (as in the first photo I posted).

If the BML is on top (has priority) and transparency is enabled, then anywhere in the bitmap there is a "00" pixel, you will see any tiles, and if there are no tiles, then you see background. If there are tiles, then seeing background depends on the tile's color settings. Clear as mud.

F18A

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members