Krool885 Posted Tuesday at 06:02 PM Share Posted Tuesday at 06:02 PM Hey everyone! It's been a while, you may remember me and my friend Robyn's snake game from back in, March maybe? Well we've been (not very) hard at work on another project, a 3D game! Our intention is to make a sibling of the 1984 space sim Elite, given the ti99 never received a port, or really anything similar. Both of us are in full time education and quite busy/lazy depending on the day so progress is slow but we've finally wrapped up the majority of the 3D code, so we thought we'd put together a little demo for you. We will include the source code this time since all we've done is written common algorithms and the like into tms9900 assembly language, but going forward when we start making unique game/creative content we might not. We will see. The demo itself is just a standard cartridge, and works fine in classic99 and on real hardware. Anyway! If any seasoned tms9900 programmers see this then please give us tips/optimizations etc. You guys are gonna be the ones playing this at the end of the day, so it's in your best interest to help us improve the frame rate. We're also more than happy to answer technical questions etc. Hope you find it cool, even if its not much, and again, we want feedback! Lily & Robyn 3D.asm 3D.bin 13 Quote Link to comment Share on other sites More sharing options...
brain Posted Tuesday at 06:36 PM Share Posted Tuesday at 06:36 PM Anyone with a working setup and recording/streaming capability who can share a Youtube link for those of us interested but with a setup currently in pieces awaiting from fixes? Jim Quote Link to comment Share on other sites More sharing options...
+OLD CS1 Posted Tuesday at 06:47 PM Share Posted Tuesday at 06:47 PM 10 minutes ago, brain said: Anyone with a working setup and recording/streaming capability who can share a Youtube link for those of us interested but with a setup currently in pieces awaiting from fixes? The hang around 30s or so is my computer, not the demo. 2024-10-01 14-44-37.mp4 2 Quote Link to comment Share on other sites More sharing options...
brain Posted Tuesday at 06:54 PM Share Posted Tuesday at 06:54 PM Thank you sir! It does look very "Elite"-ish Jim Quote Link to comment Share on other sites More sharing options...
+FarmerPotato Posted Tuesday at 07:37 PM Share Posted Tuesday at 07:37 PM Some optimizations for 9900: LI R0,0 also compares equal go 0 * CLR R0 same but does not compare to 0 LI R0,-1 " * SETO R0 " To toggle flags, see INV, NEG, ABS Clear high byte only: SB R0,R0 Initialize VDP reg: yours is 8 words per reg LI R15,>8C02 reused frequently * saves many words of cart space * then 5 words each: LI R0,>F587 text color white/blue MOVB R0,*R15 SWPB R0 MOVB R0,*R15 Better, Table-driven * one word per reg, pre swapped ORI >8000 VREGS DATA >F587,>0281,... * This loop is just 6 words long VREGSE EQU $ LI R0,VREGS MOVB *R0+,*R15 where R15 is 8C02 CI R0,VREGSE JL $-6 jump back 3 words ... * When comparing addresses, use unsigned JL,JH,JHE,JLE * not signed JLT, JGT! Convenient subroutine: SETVA MOVB R0,*R15 SWPB R0 MOVB R0,*R15 SWPB R0 optional RT dont forget to ORI R0,>4000 set write mode Params after call: BL @SETVAD DATA >0050 (pre-swapped >1000) ... SETVAD MOVB *R11+,*R15 with R15= 8C02 NOP probably unnecessary MOVB *R11+,*R15 RT Can do similar for almost table driven writes: BL @VMTBL DATA >0050 char table to >1100 DATA CHARA1 DATA CHARA$-CHARA1 length ... VMTBL uses *R11+ for address as above, MOVB *R11+,*R15 E/A says to put NOP after but prolly not needed NOP MOVB *R11+,*R15 MOV *R11+,R1 from addr MOV *R11+,R2 length VMTBL1 MOVB *R1+,@>8C00 to VDPWD DEC R2 JNE VMTBL1 RT If you want really crazy: OPMOVS EQU >D830 I think BL @VMTBLX DATA >0050 DATA Rn+OPMOVS MOVB *Rn+,@ DATA your length ... VMTBLX * code same as VMTBL up to loop: ... VMTBL2 X R1 MOVB your pointer DATA >8C00 to VDPWD (X consumes next word as address operand) DEC R2 JNE VMTBL2 RT * Using Xecute, makes this a template function. * Nobody uses this though! Multiply R0 by ten. Maybe faster than MPY? A R0,R0 MOV R0,R1 SLA R0,2 A R1,R0 * This loop is just 6 words long * This loop is just 6 words long * This loop is just 6 words long * This loop is just 6 words long * da da, da da... 2 Quote Link to comment Share on other sites More sharing options...
Krool885 Posted Tuesday at 08:20 PM Author Share Posted Tuesday at 08:20 PM 41 minutes ago, FarmerPotato said: Some optimizations for 9900: LI R0,0 also compares equal go 0 * CLR R0 same but does not compare to 0 LI R0,-1 " * SETO R0 " To toggle flags, see INV, NEG, ABS Clear high byte only: SB R0,R0 Initialize VDP reg: yours is 8 words per reg LI R15,>8C02 reused frequently * saves many words of cart space * then 5 words each: LI R0,>F587 text color white/blue MOVB R0,*R15 SWPB R0 MOVB R0,*R15 Better, Table-driven * one word per reg, pre swapped ORI >8000 VREGS DATA >F587,>0281,... * This loop is just 6 words long VREGSE EQU $ LI R0,VREGS MOVB *R0+,*R15 where R15 is 8C02 CI R0,VREGSE JL $-6 jump back 3 words ... * When comparing addresses, use unsigned JL,JH,JHE,JLE * not signed JLT, JGT! Convenient subroutine: SETVA MOVB R0,*R15 SWPB R0 MOVB R0,*R15 SWPB R0 optional RT dont forget to ORI R0,>4000 set write mode Params after call: BL @SETVAD DATA >0050 (pre-swapped >1000) ... SETVAD MOVB *R11+,*R15 with R15= 8C02 NOP probably unnecessary MOVB *R11+,*R15 RT Can do similar for almost table driven writes: BL @VMTBL DATA >0050 char table to >1100 DATA CHARA1 DATA CHARA$-CHARA1 length ... VMTBL uses *R11+ for address as above, MOVB *R11+,*R15 E/A says to put NOP after but prolly not needed NOP MOVB *R11+,*R15 MOV *R11+,R1 from addr MOV *R11+,R2 length VMTBL1 MOVB *R1+,@>8C00 to VDPWD DEC R2 JNE VMTBL1 RT If you want really crazy: OPMOVS EQU >D830 I think BL @VMTBLX DATA >0050 DATA Rn+OPMOVS MOVB *Rn+,@ DATA your length ... VMTBLX * code same as VMTBL up to loop: ... VMTBL2 X R1 MOVB your pointer DATA >8C00 to VDPWD (X consumes next word as address operand) DEC R2 JNE VMTBL2 RT * Using Xecute, makes this a template function. * Nobody uses this though! Multiply R0 by ten. Maybe faster than MPY? A R0,R0 MOV R0,R1 SLA R0,2 A R1,R0 * This loop is just 6 words long * This loop is just 6 words long * This loop is just 6 words long * This loop is just 6 words long * da da, da da... Thanks! Can't believe we missed some of these. I might stay away from the execute based one though.... 3 Quote Link to comment Share on other sites More sharing options...
+FarmerPotato Posted Wednesday at 03:04 PM Share Posted Wednesday at 03:04 PM 18 hours ago, Krool885 said: I might stay away from the execute based one though.... Yeah, on one Zoom call, (pandemic club) no one was sure how X really behaved on 2 word instructions. I was drawn to its use as a way to tell a VMBW-type routine that my pointer is in a different register, instead of always conventional R1. AFAIK, X only appears in TI engineer source code. There's 3 of them in TI Forth, but @Lee Stewart had to disassemble the CRU words to notice it. It's used in a brilliant way in TI's Microprocessor Pascal runtime. 1 Quote Link to comment Share on other sites More sharing options...
Asmusr Posted Wednesday at 03:06 PM Share Posted Wednesday at 03:06 PM (edited) I would consider using a buffer in 32K RAM that you 'upload' as fast a possible to the VDP after drawing each frame. It takes some time to do that, but it has many advantages: The buffer can be formatted as a linear bitmap (instead of being character based), which supports faster drawing algorithms. [Edit] You can draw new lines without deleting old lines using OR (SOC) instructions. Much faster than continuously reading and writing to VDP RAM. You can clear the buffer quickly using word instructions. You will avoid the flicker when the VDP RAM is cleared and then redrawn. If you need to read from the buffer, it's much faster than reading from VDP RAM. You probably don't need a buffer for the full screen, perhaps 2/3 (4K) or 1/2 (3k) is enough for your game. This is assuming you're keeping to one color. A buffer that size can be uploaded within a couple of VDP frames, giving you a base frame rate of about 10-15 FPS. After taking that performance hit, using a buffer will only be a benefit. You could potentially also use an algorithm for only uploading 'dirty' rectangles to the VDP, but that also comes with an overhead, so the algorithm needs to be very efficient to be a benefit. I don't know if an Elite type game will be fast enough on the TI? Perhaps if you stick to relatively few objects and pre-calculate everything you can, like rotation coordinates? For the latter, using a big ROM cart would be an advantage- Edited Wednesday at 03:23 PM by Asmusr 4 Quote Link to comment Share on other sites More sharing options...
+FarmerPotato Posted Wednesday at 03:06 PM Share Posted Wednesday at 03:06 PM Unrolling VDP loops: You can optimize speed by inlining the loop every time: BL @SETVA * setup >8C02 T1 MOVB *R1+,@>8C00 DEC R2 JNE T1 And use any registers you like. If you're filling chunks of VDP, the fastest instruction is LI (thanks @Tursi) BL @SETVA T2 LWPI >8C00 * R0 now aliases the VDPWD LI R1,>FF00 * maybe CLR R0 * unfortunately can't use a register as loop punter so we unroll 8 loops LI R1,>FF00 LI R1,>FF00 LI R1,>FF00 LI R1,>FF00 LI R1,>FF00 LI R1,>FF00 LI R1,>FF00 LWPI >8300 * your WS DEC R2 * your loop counter (8 bytes per loop) JNE T2 Intructions like MOVB, AB, SOCB cause extra memory cycles. But LI will not. MOVB reads 16 bits of the destination first, to preserve the low 8 bits. Even MOV does this on the 9900. ("It was to the designer's distinct advantage to do it this way" -- TI patent disclosure). I think CLR also avoids read-before-write. 3 Quote Link to comment Share on other sites More sharing options...
+FarmerPotato Posted Wednesday at 03:30 PM Share Posted Wednesday at 03:30 PM 27 minutes ago, Asmusr said: The buffer can be formatted as a linear bitmap (instead of being character based), which supports faster drawing algorithms. Rasmus might also have a Screen Image Table layout recommendation. TI example codes for bitmap mode, they fill each 1/3 screen with the sequence 0-255. Then, calculating a dot address is ...yuck. The first 32 "patterns" correspond to a 256-pixel wide and 8-pixel high strip.. with each 8x8 pixel block being defined top to bottom. The E/A manual gives 8 lines of assembly to translate a X,Y pixel address to a VDP address and bit offset. You get an advantage by organizing it in vertical strips, 8 pixels wide, 64 pixels high. * Fill the screen image table for bitmap mode vertical strips BL @SETVA * have this set midle third of screen or whatever CLR R1 T3 MOVB R1,@>8C00 or @VDPWD AI R1,>800 add 8 for next column strip JEQ T4 outer loop ends when 00 reached again JNC T3 inner loop ends at carry out, >=256 AI R1,>100 prepare next row JMP T3 T4 ... I also think of a layout where a 16x16 tile is in 32 contiguous bytes of VDP RAM. So, like, a double size "sprite" pattern. This is suitable for a tile-based playing field, like a Zelda-type RPG. So a 1/3 of screen image table is: 0,2,4,6...62 1,3,5,7...63 64,66,68,70...126 65,67,69,71...127 and so on. You don't need to be limited by TI's default 0..255 indexing. 3 Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted Wednesday at 07:31 PM Share Posted Wednesday at 07:31 PM 3 hours ago, FarmerPotato said: Yeah, on one Zoom call, (pandemic club) no one was sure how X really behaved on 2 word instructions. I was drawn to its use as a way to tell a VMBW-type routine that my pointer is in a different register, instead of always conventional R1. I either missed that call or was dozing! When you use the X instruction, you should imagine that the referenced instruction is at the same place as X, i.e., any words that are expected to follow the referenced instruction actually must follow X. The program counter is changed relative to the address of X—not that of the referenced instruction. You can think of X as replacing itself with the referenced instruction and proceeding in line. ...lee 2 Quote Link to comment Share on other sites More sharing options...
Eric Lafortune Posted Thursday at 09:58 AM Share Posted Thursday at 09:58 AM Wow, an ambitious goal and impressive results! Complete 3D transformations, backface culling, projection, Bresenham, all in assembly -- this must be a first! Great work. You seem to know very well what you're doing, but some thoughts at a high level: If the number of vertices increases, it may become more efficient to compute the overall rotation matrix first (R = Rz * Ry * Rx) and then transform the vertices (6 multiplications per vertex to get x and y; 3 more to get z if necessary), instead of applying the sequence of individual rotation matrices (12 multiplications per vertex). It might be more efficient to precompute the normals and transform those to do the backface culling first (with one transformed vertex), before transforming all vertices. It would require lazy vertex transformation and caching though. You may be drawing the shared edges of adjacent visible triangles twice? Micro-optimizations that won't make much practical difference: Check for a terminator after small positive numbers: "MOV *R14+,R10, CI R10,>FFFF, JEQ ..." ---> "MOV *R14+,R10, JLT ..." Check for negative two's complement numbers: "CI R0,>8000 , JL ..." ---> "MOV R0,R0, JLT ..." (swapping the branches though) Flags for signed multiply: "CLR R15 ... INC R15 ... CI R15,1 , JEQ ..." ---> "CLR R15 ... INV R15 ... MOV R15,R15 , JLT ..." Flags for signed multiply: you can avoid them altogether by having two (or a few) distinct code paths: "MPY" and "NEG, MPY, NEG" If the computer can't get it done in real-time, it may still be fun to draw some near real-time landscapes, buildings, masks, teapots,... 4 Quote Link to comment Share on other sites More sharing options...
Krool885 Posted Thursday at 07:54 PM Author Share Posted Thursday at 07:54 PM On 10/2/2024 at 4:06 PM, Asmusr said: I would consider using a buffer in 32K RAM that you 'upload' as fast a possible to the VDP after drawing each frame. It takes some time to do that, but it has many advantages: The buffer can be formatted as a linear bitmap (instead of being character based), which supports faster drawing algorithms. [Edit] You can draw new lines without deleting old lines using OR (SOC) instructions. Much faster than continuously reading and writing to VDP RAM. You can clear the buffer quickly using word instructions. You will avoid the flicker when the VDP RAM is cleared and then redrawn. If you need to read from the buffer, it's much faster than reading from VDP RAM. You probably don't need a buffer for the full screen, perhaps 2/3 (4K) or 1/2 (3k) is enough for your game. This is assuming you're keeping to one color. A buffer that size can be uploaded within a couple of VDP frames, giving you a base frame rate of about 10-15 FPS. After taking that performance hit, using a buffer will only be a benefit. You could potentially also use an algorithm for only uploading 'dirty' rectangles to the VDP, but that also comes with an overhead, so the algorithm needs to be very efficient to be a benefit. I don't know if an Elite type game will be fast enough on the TI? Perhaps if you stick to relatively few objects and pre-calculate everything you can, like rotation coordinates? For the latter, using a big ROM cart would be an advantage- We, um, well... Aren't using the extra 32k. We wanted to develop something that would be as close to a feasible product of 1984-ish as possible and for that reason we're making this for stock console + cartridge. Otherwise absolutely we'd be doing a lot more with that extra ram. We realise that we've shot ourselves in the foot doing it this way but well, didn't feel quite as cool doing it with the extra ram. If we're motivated we might switch some code out to allow for higher frame rates if the extra ram is detected, but that's a long way off. 1 Quote Link to comment Share on other sites More sharing options...
Krool885 Posted Thursday at 08:00 PM Author Share Posted Thursday at 08:00 PM 2 minutes ago, Krool885 said: We, um, well... Aren't using the extra 32k. We wanted to develop something that would be as close to a feasible product of 1984-ish as possible and for that reason we're making this for stock console + cartridge. Otherwise absolutely we'd be doing a lot more with that extra ram. We realise that we've shot ourselves in the foot doing it this way but well, didn't feel quite as cool doing it with the extra ram. If we're motivated we might switch some code out to allow for higher frame rates if the extra ram is detected, but that's a long way off. Oh and I forgot to mention, we are intending on using a fairly big ROM. We're hoping to avoid the 512k monster that is flying shark but 64k or so potentially. Frame rate size... Yeah it'll be slow. But Elite was slow at times too, especially on the C64. I think we'll be able to make something that would be playable by 1984 standards (so not very) I mean look at that one 3d Egyptian themed game I forget the name of from the same time period. A frame every few seconds. But people still enjoyed it. We should have a couple per second if everything pans out. 2 Quote Link to comment Share on other sites More sharing options...
+OLD CS1 Posted Thursday at 08:52 PM Share Posted Thursday at 08:52 PM 7 hours ago, Krool885 said: We, um, well... Aren't using the extra 32k. We wanted to develop something that would be as close to a feasible product of 1984-ish as possible and for that reason we're making this for stock console + cartridge. Otherwise absolutely we'd be doing a lot more with that extra ram. We realise that we've shot ourselves in the foot doing it this way but well, didn't feel quite as cool doing it with the extra ram. If we're motivated we might switch some code out to allow for higher frame rates if the extra ram is detected, but that's a long way off. Something else to consider which was possible and done in the era: RAM in cartridge space. Classic99 supports this in some configuration which I cannot detail (I think all unused space is RAM by default in C99... might need to ping @Tursi) This was done for MiniMemory and some MBX carts. Since you are developing this as a cart in the first place, why not work cart-space RAM into the equation? Even some Atari 2600 carts like Tunnel Runner include RAM. 3 Quote Link to comment Share on other sites More sharing options...
Tursi Posted Friday at 10:11 PM Share Posted Friday at 10:11 PM * This loop is just 6 words long * This loop is just 6 words long * This loop is just 6 words long * This loop is just 6 words long * da da, da da... LOL! That's exactly what I was thinking when I saw the first comment Also I completely forgot SB R0,R0. The number of places I could have used that... Re: VDP Wait states - I am at 99.99% confidence that no wait states are needed for register access to the VDP. I've tested this extensively on the ColecoVision as well, where overrun is very possible. The only time you need wait states is when the VDP needs to communicate with VRAM. So when writing the address register, no wait needed. But if you are pushing it to the limit, you do need to consider the VDP sequencing: VDP Read Byte: write address, write address, VDP ACCESSES VRAM, read data, VDP ACCCESSES VRAM, read data, etc... VDP Write Byte: write address, write address, write data, VDP ACCESSES VRAM, write data, VDP ACCESSES VRAM, etc... Read-only registers should all be safe too. And obviously don't change the registers during a VRAM access period. There's no magic to it. The reason overrun happens is because the VDP is only allowed to service CPU memory requests every so many cycles. If you send a second request before the first one is processed, there is no queue, it just overwrites the old request. Of course, on the TI, between the read-before-write and the multiplexer wait states, there are very very few sequences that are fast enough to overrun the VDP anyway. The most common one is reading the data register after setting a read address. If you're in 8-bit memory, then even that is going to be fine. I think all unused space is RAM by default in C99... might need to ping @Tursi This is correct, but if you're running from cartridge then you'll need a bank switch scheme. Classic99 doesn't support any RAM based cartridge schemes right now except for MiniMemory, which is only 4k each RAM and ROM. (Oh, and the MBX scheme which I think gives you 1k RAM and banked ROM... but it's not well proven). So you'll might want to roll your own. 2 Quote Link to comment Share on other sites More sharing options...
+FarmerPotato Posted Saturday at 12:52 AM Share Posted Saturday at 12:52 AM On 10/3/2024 at 4:58 AM, Eric Lafortune said: Flags for signed multiply: "CLR R15 ... INC R15 ... CI R15,1 , JEQ ..." ---> "CLR R15 ... INV R15 ... MOV R15,R15 , JLT ..." Flags for signed multiply: you can avoid them altogether by having two (or a few) distinct code paths: "MPY" and "NEG, MPY, NEG" 1. I wonder if you can avoid the write-destination and shave off a cycle? MOV R15,R15 JLT ... H8000 DATA >8000 COC @H8000,R15 test sign bit JEQ ... (However, H8000 is probably in slow RAM, so you add wait states.) 2. Use ABS to test sign. It compares the source to zero, before writing the absolute value. Can be faster than MOV R15,R15. CLR R4 init flag ABS R15 first compares to zero JGT T2 (never mind zero) SETO R4 set flag. T2 MPY R15,R14 ABS R4 test flag JEQ T3 NEG R15 T3 BTW, ABS is an atomic operation used by TI to implement semaphores. (On 99000, it even locks the memory bus in between read and write.) 3. Here's another way: * hold my beer (the PABST macro) ABS R15 first compares to zero STST R4 save status bits for later T2 MPY R15,R14 ANDI R4,>6000 isolate A> and EQ bits, compare to zero JNE T3 got an A> or EQ bit (both is impossible) NEG R14 apply sign, assuming only top 16 bits matter T3 4. Checking both inputs for sign: * test signs of R14 and R15 * hold my beer (the PABST macro) ABS R14 first compares to zero STST R4 save status bits for later ABS R15 first compares to zero STST R5 save status bits for later XOR R4,R5 XOR the flags T2 MPY R15,R14 ANDI R4,>4000 isolate A> bit (Arithmetic>) JEQ T3 jump if neither or both >0 NEG R14 apply sign, assuming only top 16 bits matter T3 As @Asmusr says, fewer instructions will almost always be more optimal on the 4A. There are few cases where 2 swift instructions beat 1 slower one. (When the mix varies from 14 to 40 cycles.) If you expect to have a lot of 0s, and you want to avoid the MPY: after each ABS, use JEQ to get out. In the final code, 0s are treated as negative. This is ok, since NEG 0 is still 0. Other ways would be more instructions on average.* ANDI R4,>6000 JEQ T3 AI R4,->4000 already costs more than NEG R4 JNE T3 an input was 0 NEG R4 Truth Tables: Spoiler Compare to 0: Status ? A> EQ +---+------+ | N | 0 0 | | 0 | 0 1 | | P | 1 0 | +-----+----+ Y = A*B A B Y ST ST XOR +-----+---+-------+----+ | N N | P | 00 00 | 00 | | N 0 | 0 | 00 01 | 01 | | N P | N | 00 10 | 10 | | 0 N | 0 | 01 00 | 01 | | 0 0 | 0 | 01 01 | 00 | | 0 P | 0 | 01 10 | 11 | | P N | N | 10 00 | 10 | | P 0 | 0 | 10 01 | 11 | | P P | P | 10 10 | 00 | +-----+---+-------+----+ 2 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.