Jump to content
IGNORED

3D demo!


Krool885

Recommended Posts

Hey everyone!

 

It's been a while, you may remember me and my friend Robyn's snake game from back in, March maybe? Well we've been (not very) hard at work on another project, a 3D game! Our intention is to make a sibling of the 1984 space sim Elite, given the ti99 never received a port, or really anything similar.

 

Both of us are in full time education and quite busy/lazy depending on the day so progress is slow but we've finally wrapped up the majority of the 3D code, so we thought we'd put together a little demo for you. We will include the source code this time since all we've done is written common algorithms and the like into tms9900 assembly language, but going forward when we start making unique game/creative content we might not. We will see. The demo itself is just a standard cartridge, and works fine in classic99 and on real hardware.

 

Anyway! If any seasoned tms9900 programmers see this then please give us tips/optimizations etc. You guys are gonna be the ones playing this at the end of the day, so it's in your best interest to help us improve the frame rate. We're also more than happy to answer technical questions etc. Hope you find it cool, even if its not much, and again, we want feedback!

 

Lily & Robyn

3D.asm 3D.bin

  • Like 13
Link to comment
Share on other sites

10 minutes ago, brain said:

Anyone with a working setup and recording/streaming capability who can share a Youtube link for those of us interested but with a setup currently in pieces awaiting from fixes?

The hang around 30s or so is my computer, not the demo.

 

 

 

  • Like 2
Link to comment
Share on other sites

Some optimizations for 9900:


LI  R0,0    also compares equal go 0
* CLR R0   same but does not compare to 0

LI R0,-1   "
* SETO R0  "

To toggle flags, see INV, NEG, ABS

Clear high byte only:
SB R0,R0

Initialize VDP reg: yours is 8 words per reg
LI   R15,>8C02    reused frequently
* saves many words of cart space

* then 5 words each:
LI   R0,>F587   text color white/blue
MOVB R0,*R15
SWPB R0
MOVB R0,*R15

Better, Table-driven  

* one word per reg, pre swapped ORI >8000
VREGS DATA >F587,>0281,...
* This loop is just 6 words long
VREGSE EQU $
  LI R0,VREGS
  MOVB *R0+,*R15   where R15 is 8C02
  CI   R0,VREGSE
  JL   $-6          jump back 3 words
  ...
* When comparing addresses, use unsigned JL,JH,JHE,JLE
* not signed JLT, JGT!

Convenient subroutine:

SETVA MOVB R0,*R15
  SWPB R0
  MOVB R0,*R15
  SWPB R0        optional
  RT

dont forget to
  ORI  R0,>4000   set write mode

Params after call:

  BL   @SETVAD
  DATA >0050    (pre-swapped >1000)
  ...
SETVAD MOVB *R11+,*R15  with R15= 8C02
  NOP                    probably unnecessary
  MOVB *R11+,*R15
  RT

Can do similar for almost table driven writes:
  BL   @VMTBL
  DATA >0050  char table to >1100
  DATA CHARA1
  DATA CHARA$-CHARA1   length 
  ...

 VMTBL uses *R11+ for address as above,
 
  MOVB *R11+,*R15   E/A says to put NOP after but prolly not needed
  NOP
  MOVB *R11+,*R15   
  MOV  *R11+,R1     from addr
  MOV  *R11+,R2     length
VMTBL1
  MOVB *R1+,@>8C00  to VDPWD
  DEC  R2
  JNE  VMTBL1   
  RT

If you want really crazy:
OPMOVS EQU >D830   I think
  BL   @VMTBLX
  DATA >0050
  DATA Rn+OPMOVS MOVB *Rn+,@
  DATA your length
  ...

VMTBLX
* code same as VMTBL up to loop:
 ...
VMTBL2
  X   R1       MOVB your pointer
  DATA >8C00  to VDPWD (X consumes next word as address operand)
  DEC  R2
  JNE  VMTBL2   
  RT
* Using Xecute, makes this a template function.
* Nobody uses this though!



Multiply R0 by ten. Maybe faster than MPY?
A   R0,R0
MOV R0,R1
SLA R0,2
A   R1,R0


* This loop is just 6 words long
* This loop is just 6 words long
* This loop is just 6 words long
* This loop is just 6 words long
* da da, da da...

 

 

 

  • Like 2
Link to comment
Share on other sites

41 minutes ago, FarmerPotato said:
Some optimizations for 9900:


LI  R0,0    also compares equal go 0
* CLR R0   same but does not compare to 0

LI R0,-1   "
* SETO R0  "

To toggle flags, see INV, NEG, ABS

Clear high byte only:
SB R0,R0

Initialize VDP reg: yours is 8 words per reg
LI   R15,>8C02    reused frequently
* saves many words of cart space

* then 5 words each:
LI   R0,>F587   text color white/blue
MOVB R0,*R15
SWPB R0
MOVB R0,*R15

Better, Table-driven  

* one word per reg, pre swapped ORI >8000
VREGS DATA >F587,>0281,...
* This loop is just 6 words long
VREGSE EQU $
  LI R0,VREGS
  MOVB *R0+,*R15   where R15 is 8C02
  CI   R0,VREGSE
  JL   $-6          jump back 3 words
  ...
* When comparing addresses, use unsigned JL,JH,JHE,JLE
* not signed JLT, JGT!

Convenient subroutine:

SETVA MOVB R0,*R15
  SWPB R0
  MOVB R0,*R15
  SWPB R0        optional
  RT

dont forget to
  ORI  R0,>4000   set write mode

Params after call:

  BL   @SETVAD
  DATA >0050    (pre-swapped >1000)
  ...
SETVAD MOVB *R11+,*R15  with R15= 8C02
  NOP                    probably unnecessary
  MOVB *R11+,*R15
  RT

Can do similar for almost table driven writes:
  BL   @VMTBL
  DATA >0050  char table to >1100
  DATA CHARA1
  DATA CHARA$-CHARA1   length 
  ...

 VMTBL uses *R11+ for address as above,
 
  MOVB *R11+,*R15   E/A says to put NOP after but prolly not needed
  NOP
  MOVB *R11+,*R15   
  MOV  *R11+,R1     from addr
  MOV  *R11+,R2     length
VMTBL1
  MOVB *R1+,@>8C00  to VDPWD
  DEC  R2
  JNE  VMTBL1   
  RT

If you want really crazy:
OPMOVS EQU >D830   I think
  BL   @VMTBLX
  DATA >0050
  DATA Rn+OPMOVS MOVB *Rn+,@
  DATA your length
  ...

VMTBLX
* code same as VMTBL up to loop:
 ...
VMTBL2
  X   R1       MOVB your pointer
  DATA >8C00  to VDPWD (X consumes next word as address operand)
  DEC  R2
  JNE  VMTBL2   
  RT
* Using Xecute, makes this a template function.
* Nobody uses this though!



Multiply R0 by ten. Maybe faster than MPY?
A   R0,R0
MOV R0,R1
SLA R0,2
A   R1,R0


* This loop is just 6 words long
* This loop is just 6 words long
* This loop is just 6 words long
* This loop is just 6 words long
* da da, da da...

 

 

 

Thanks! Can't believe we missed some of these. I might stay away from the execute based one though....

  • Like 3
Link to comment
Share on other sites

18 hours ago, Krool885 said:

I might stay away from the execute based one though....

Yeah,  on one Zoom call, (pandemic club) no one was sure how X really behaved on 2 word instructions. I was drawn to its use as a way to tell a VMBW-type routine that my pointer is in a different register, instead of always conventional R1.


 

AFAIK, X only appears in TI engineer source code. There's 3 of them in TI Forth, but @Lee Stewart had to disassemble the CRU words to notice it.  It's used in a brilliant way in TI's Microprocessor Pascal runtime.

  • Like 1
Link to comment
Share on other sites

I would consider using a buffer in 32K RAM that you 'upload' as fast a possible to the VDP after drawing each frame. It takes some time to do that, but it has many advantages:

  • The buffer can be formatted as a linear bitmap (instead of being character based), which supports faster drawing algorithms.
  • [Edit] You can draw new lines without deleting old lines using OR (SOC) instructions. Much faster than continuously reading and writing to VDP RAM.   
  • You can clear the buffer quickly using word instructions.
  • You will avoid the flicker when the VDP RAM is cleared and then redrawn.
  • If you need to read from the buffer, it's much faster than reading from VDP RAM.

You probably don't need a buffer for the full screen, perhaps 2/3 (4K) or 1/2 (3k) is enough for your game. This is assuming you're keeping to one color. A buffer that size can be uploaded within a couple of VDP frames, giving you a base frame rate of about 10-15 FPS. After taking that performance hit, using a buffer will only be a benefit.

 

You could potentially also use an algorithm for only uploading 'dirty' rectangles to the VDP, but that also comes with an overhead, so the algorithm needs to be very efficient to be a benefit.

 

I don't know if an Elite type game will be fast enough on the TI? Perhaps if you stick to relatively few objects and pre-calculate everything you can, like rotation coordinates? For the latter, using a big ROM cart would be an advantage-

 

Edited by Asmusr
  • Like 4
Link to comment
Share on other sites

Unrolling VDP loops:

 

You can optimize speed by inlining the loop every time:


  BL  @SETVA   * setup >8C02

T1

  MOVB *R1+,@>8C00

  DEC  R2

  JNE  T1

 

And use any registers you like. 
 

 

 

If you're filling chunks of VDP, the fastest instruction is LI (thanks @Tursi


 

 BL @SETVA
T2
  LWPI >8C00    * R0 now aliases the VDPWD
  LI   R1,>FF00   * maybe CLR R0
* unfortunately can't use a register as loop punter so we unroll 8 loops 
  LI   R1,>FF00
  LI   R1,>FF00
  LI   R1,>FF00
  LI   R1,>FF00
  LI   R1,>FF00
  LI   R1,>FF00
  LI   R1,>FF00
  LWPI >8300 * your WS
  DEC R2        * your loop counter (8 bytes per loop)
  JNE T2

 

Intructions like MOVB, AB, SOCB cause extra memory cycles.  But LI will not.
 

MOVB reads 16 bits of the destination first, to preserve the low 8 bits. Even  MOV does this on the 9900. ("It was to the designer's distinct advantage to do it this way" -- TI patent disclosure). 
 

I think CLR also avoids read-before-write. 

 

  • Like 3
Link to comment
Share on other sites

27 minutes ago, Asmusr said:

The buffer can be formatted as a linear bitmap (instead of being character based), which supports faster drawing algorithms.

Rasmus might also have a Screen Image Table layout recommendation. 
 

TI example codes for bitmap mode, they fill each 1/3 screen with the sequence 0-255. Then, calculating a dot address is ...yuck. The first 32 "patterns" correspond to a 256-pixel wide and 8-pixel high strip.. with each 8x8 pixel block being defined top to bottom. The E/A manual gives 8 lines of assembly to translate a X,Y pixel address to a VDP address and bit offset. 
 

You get an advantage by organizing it in vertical strips, 8 pixels wide, 64 pixels high.  
 

* Fill the screen image table for bitmap mode vertical strips


  BL   @SETVA    * have this set midle third of screen or whatever
  CLR  R1
T3
  MOVB R1,@>8C00   or @VDPWD
  AI   R1,>800     add 8 for next column strip
  JEQ  T4         outer loop ends when 00 reached again 
  JNC  T3          inner loop ends at carry out,  >=256
  AI   R1,>100     prepare next row
  JMP  T3          
T4  ...



  
  


I also think of a layout where a 16x16 tile is in  32 contiguous bytes of VDP RAM. So, like, a double size "sprite" pattern. This is suitable for a tile-based playing field, like a Zelda-type RPG. 

 

So a 1/3 of screen image table is:

0,2,4,6...62

1,3,5,7...63

64,66,68,70...126

65,67,69,71...127

and so on. 
 

 

You don't need to be limited by TI's default 0..255 indexing. 
 

 


 

 

  • Like 3
Link to comment
Share on other sites

3 hours ago, FarmerPotato said:

Yeah,  on one Zoom call, (pandemic club) no one was sure how X really behaved on 2 word instructions. I was drawn to its use as a way to tell a VMBW-type routine that my pointer is in a different register, instead of always conventional R1.

 

I either missed that call or was dozing! When you use the X instruction, you should imagine that the referenced instruction is at the same place as X, i.e., any words that are expected to follow the referenced instruction actually must follow X. The program counter is changed relative to the address of X—not that of the referenced instruction. You can think of X as replacing itself with the referenced instruction and proceeding in line.

 

...lee

 

  • Like 2
Link to comment
Share on other sites

Wow, an ambitious goal and impressive results! Complete 3D transformations, backface culling, projection, Bresenham, all in assembly -- this must be a first! Great work.

 

You seem to know very well what you're doing, but some thoughts at a high level:

  • If the number of vertices increases, it may become more efficient to compute the overall rotation matrix first (R = Rz * Ry * Rx) and then transform the vertices (6 multiplications per vertex to get x and y; 3 more to get z if necessary), instead of applying the sequence of individual rotation matrices (12 multiplications per vertex).
  • It might be more efficient to precompute the normals and transform those to do the backface culling first (with one transformed vertex), before transforming all vertices. It would require lazy vertex transformation and caching though.
  • You may be drawing the shared edges of adjacent visible triangles twice?

Micro-optimizations that won't make much practical difference:

  • Check for a terminator after small positive numbers: "MOV *R14+,R10, CI R10,>FFFF, JEQ ..." ---> "MOV *R14+,R10, JLT ..."
  • Check for negative two's complement numbers: "CI R0,>8000 , JL ..." ---> "MOV R0,R0, JLT ..." (swapping the branches though)
  • Flags for signed multiply: "CLR R15 ... INC R15 ... CI R15,1 , JEQ ..." ---> "CLR R15 ... INV R15 ... MOV R15,R15 , JLT ..."
  • Flags for signed multiply: you can avoid them altogether by having two (or a few) distinct code paths: "MPY" and "NEG, MPY, NEG"

If the computer can't get it done in real-time, it may still be fun to draw some near real-time landscapes, buildings, masks, teapots,...

  • Like 4
Link to comment
Share on other sites

On 10/2/2024 at 4:06 PM, Asmusr said:

I would consider using a buffer in 32K RAM that you 'upload' as fast a possible to the VDP after drawing each frame. It takes some time to do that, but it has many advantages:

  • The buffer can be formatted as a linear bitmap (instead of being character based), which supports faster drawing algorithms.
  • [Edit] You can draw new lines without deleting old lines using OR (SOC) instructions. Much faster than continuously reading and writing to VDP RAM.   
  • You can clear the buffer quickly using word instructions.
  • You will avoid the flicker when the VDP RAM is cleared and then redrawn.
  • If you need to read from the buffer, it's much faster than reading from VDP RAM.

You probably don't need a buffer for the full screen, perhaps 2/3 (4K) or 1/2 (3k) is enough for your game. This is assuming you're keeping to one color. A buffer that size can be uploaded within a couple of VDP frames, giving you a base frame rate of about 10-15 FPS. After taking that performance hit, using a buffer will only be a benefit.

 

You could potentially also use an algorithm for only uploading 'dirty' rectangles to the VDP, but that also comes with an overhead, so the algorithm needs to be very efficient to be a benefit.

 

I don't know if an Elite type game will be fast enough on the TI? Perhaps if you stick to relatively few objects and pre-calculate everything you can, like rotation coordinates? For the latter, using a big ROM cart would be an advantage-

 

We, um, well... Aren't using the extra 32k.

 

We wanted to develop something that would be as close to a feasible product of 1984-ish as possible and for that reason we're making this for stock console + cartridge.

 

Otherwise absolutely we'd be doing a lot more with that extra ram. We realise that we've shot ourselves in the foot doing it this way but well, didn't feel quite as cool doing it with the extra ram. If we're motivated we might switch some code out to allow for higher frame rates if the extra ram is detected, but that's a long way off.

  • Like 1
Link to comment
Share on other sites

2 minutes ago, Krool885 said:

We, um, well... Aren't using the extra 32k.

 

We wanted to develop something that would be as close to a feasible product of 1984-ish as possible and for that reason we're making this for stock console + cartridge.

 

Otherwise absolutely we'd be doing a lot more with that extra ram. We realise that we've shot ourselves in the foot doing it this way but well, didn't feel quite as cool doing it with the extra ram. If we're motivated we might switch some code out to allow for higher frame rates if the extra ram is detected, but that's a long way off.

Oh and I forgot to mention, we are intending on using a fairly big ROM. We're hoping to avoid  the 512k monster that is flying shark but 64k or so potentially. Frame rate size... Yeah it'll be slow. But Elite was slow at times too, especially on the C64. I think we'll be able to make something that would be playable by 1984 standards (so not very) I mean look at that one 3d Egyptian themed game I forget the name of from the same time period. A frame every few seconds. But people still enjoyed it. We should have a couple per second if everything pans out.

  • Like 2
Link to comment
Share on other sites

7 hours ago, Krool885 said:

We, um, well... Aren't using the extra 32k.

 

We wanted to develop something that would be as close to a feasible product of 1984-ish as possible and for that reason we're making this for stock console + cartridge.

 

Otherwise absolutely we'd be doing a lot more with that extra ram. We realise that we've shot ourselves in the foot doing it this way but well, didn't feel quite as cool doing it with the extra ram. If we're motivated we might switch some code out to allow for higher frame rates if the extra ram is detected, but that's a long way off.

Something else to consider which was possible and done in the era: RAM in cartridge space.  Classic99 supports this in some configuration which I cannot detail (I think all unused space is RAM by default in C99... might need to ping @Tursi)  This was done for MiniMemory and some MBX carts.  Since you are developing this as a cart in the first place, why not work cart-space RAM into the equation?  Even some Atari 2600 carts like Tunnel Runner include RAM.

  • Like 3
Link to comment
Share on other sites

* This loop is just 6 words long
* This loop is just 6 words long
* This loop is just 6 words long
* This loop is just 6 words long
* da da, da da...

 

LOL! That's exactly what I was thinking when I saw the first comment ;)

 

Also I completely forgot SB R0,R0. The number of places I could have used that... ;)

 

Re: VDP Wait states - I am at 99.99% confidence that no wait states are needed for register access to the VDP. I've tested this extensively on the ColecoVision as well, where overrun is very possible. The only time you need wait states is when the VDP needs to communicate with VRAM. So when writing the address register, no wait needed. But if you are pushing it to the limit, you do need to consider the VDP sequencing:

 

VDP Read Byte: write address, write address, VDP ACCESSES VRAM, read data, VDP ACCCESSES VRAM, read data, etc...
VDP Write Byte: write address, write address, write data, VDP ACCESSES VRAM, write data, VDP ACCESSES VRAM, etc...

Read-only registers should all be safe too. And obviously don't change the registers during a VRAM access period.

There's no magic to it. The reason overrun happens is because the VDP is only allowed to service CPU memory requests every so many cycles. If you send a second request before the first one is processed, there is no queue, it just overwrites the old request.

Of course, on the TI, between the read-before-write and the multiplexer wait states, there are very very few sequences that are fast enough to overrun the VDP anyway. The most common one is reading the data register after setting a read address. If you're in 8-bit memory, then even that is going to be fine.
 

I think all unused space is RAM by default in C99... might need to ping @Tursi


This is correct, but if you're running from cartridge then you'll need a bank switch scheme. Classic99 doesn't support any RAM based cartridge schemes right now except for MiniMemory, which is only 4k each RAM and ROM. (Oh, and the MBX scheme which I think gives you 1k RAM and banked ROM... but it's not well proven). So you'll might want to roll your own.

  • Like 2
Link to comment
Share on other sites

On 10/3/2024 at 4:58 AM, Eric Lafortune said:
  • Flags for signed multiply: "CLR R15 ... INC R15 ... CI R15,1 , JEQ ..." ---> "CLR R15 ... INV R15 ... MOV R15,R15 , JLT ..."
  • Flags for signed multiply: you can avoid them altogether by having two (or a few) distinct code paths: "MPY" and "NEG, MPY, NEG"

 

1. I wonder if you can avoid the write-destination and shave off a cycle?

 

  MOV R15,R15
  JLT ...

H8000 DATA >8000
  COC @H8000,R15       test sign bit
  JEQ ...

 

(However, H8000 is probably in slow RAM, so you add wait states.)

 

2. Use ABS to test sign.  It compares the source to zero, before writing the absolute value. Can be faster than MOV R15,R15.

 

  CLR  R4      init flag
  ABS  R15     first compares to zero
  JGT  T2      (never mind zero)
  SETO R4      set flag.

T2 MPY R15,R14

  ABS  R4      test flag
  JEQ  T3
  NEG R15
T3

 

BTW, ABS is an atomic operation used by TI to implement semaphores.  (On 99000, it even locks the memory bus in between read and write.)

 

3. Here's another way:

 

 

* hold my beer (the PABST macro)
  ABS  R15       first compares to zero
  STST R4        save status bits for later

T2 MPY R15,R14

  ANDI R4,>6000  isolate A> and EQ bits, compare to zero
  JNE  T3        got an A> or EQ bit (both is impossible)
  NEG R14        apply sign, assuming only top 16 bits matter
T3

 

4. Checking both inputs for sign:

 

* test signs of R14 and R15

* hold my beer (the PABST macro)
  ABS  R14       first compares to zero
  STST R4        save status bits for later
  ABS  R15       first compares to zero
  STST R5        save status bits for later
  XOR  R4,R5     XOR the flags

T2 MPY R15,R14
  ANDI R4,>4000  isolate A> bit (Arithmetic>)
  JEQ  T3        jump if neither or both >0 
  NEG R14        apply sign, assuming only top 16 bits matter
T3

 

As @Asmusr says, fewer instructions will almost always be more optimal on the 4A.  There are few cases where 2 swift instructions beat 1 slower one.  (When the mix varies from 14 to 40 cycles.)


If you expect to have a lot of 0s, and you want to avoid the MPY:  after each ABS, use JEQ to get out. 

 

In  the final code, 0s are treated as negative.  This is ok, since NEG 0 is still 0.  Other ways would be more instructions on average.*

 

  ANDI R4,>6000
  JEQ  T3            
  AI   R4,->4000    already costs more than NEG R4
  JNE  T3           an input was 0
  NEG  R4

 

 

Truth Tables:


 

Spoiler

 


Compare to 0:

     Status
  ?   A> EQ
+---+------+
| N | 0  0 |
| 0 | 0  1 |
| P | 1  0 |
+-----+----+

Y = A*B

  A B   Y   ST ST   XOR 
+-----+---+-------+----+
| N N | P | 00 00 | 00 |
| N 0 | 0 | 00 01 | 01 |
| N P | N | 00 10 | 10 |
| 0 N | 0 | 01 00 | 01 |
| 0 0 | 0 | 01 01 | 00 |
| 0 P | 0 | 01 10 | 11 |
| P N | N | 10 00 | 10 |
| P 0 | 0 | 10 01 | 11 |
| P P | P | 10 10 | 00 |
+-----+---+-------+----+

 

  • Like 2
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...