supercat Posted January 2, 2006 Share Posted January 2, 2006 For the Wormy game I'm working on (2600), I'm trying to cut kernel stack usage to the absolute minimum. Two bytes is unacceptable if one will do. Right now, I save a byte by generating a table of all the locations from which the kernel may be called (actually, a table of MSBs and LSBs). To invoke the kernel, I use a macro that generates: BRK byte n where 'n' is zero for the first call, '1' for the second, etc. At the end of the kernel, I perform: ldx #$FF txs ldx $FF [saved index value] lda RET_MSB,x pha lda RET_LSB,x pha rts Each kernel call thus requires four bytes of ROM (one for the BRK, one for the index, and two for the return address). Unfortunately, the bytes are not consecutive. Presently, I'm using a kludged-together post-processor so I can use non-sequential ORG directives, but that seems rather ugly. Is there a better way to do things? (BTW, I'm well aware that the BRK trashes $FD-$FF, and the above code trashes $FE-$FF, but what matters is that the kernel is free to trash $FD and $FE.) Quote Link to comment Share on other sites More sharing options...
Bruce Tomlin Posted January 2, 2006 Share Posted January 2, 2006 (edited) I'm not sure what DASM will let you do, but what I've done on the 7800, where graphics are very non-consecutive, is to use a macro that does a bunch of SEG directives to drop the code into each area, then SEGs back to the code segment. Of course all this is preceeded by a bunch of SEGs and ORGs to initialize each segment's address. Then it's just a matter of having "IF (. > $xxxx) / ERROR message / ENDIF" blocks to check that I haven't gone over the alloted space. EDIT: something like this SEG jmptbl ORG JMPTAB SEG CODE counter = 0 MACRO KERN BRK DB counter counter = counter + 1 temp = . SEG jmptbl DW temp SEG CODE ENDM Edited January 2, 2006 by Bruce Tomlin Quote Link to comment Share on other sites More sharing options...
+batari Posted January 2, 2006 Share Posted January 2, 2006 Each kernel call thus requires four bytes of ROM (one for the BRK, one for the index, and two for the return address). Unfortunately, the bytes are not consecutive. Presently, I'm using a kludged-together post-processor so I can use non-sequential ORG directives, but that seems rather ugly. Is there a better way to do things? 992399[/snapback] I don't see why you need non-sequential ORG's. It seems to be that two tables could just contain .byte <(label-1) and .byte >(label-1) where "label" is placed where you want the code to return to. The index for the BRK isn't even needed since you don't really care what addresses get put into the stack. You could even pass the index in the accumulator and store it in $ff at the beginning of the kernel, then that would save the need to fetch the index from the byte after the BRK every time (is that what you're doing?) e.g.: lda #1; index value brk label1; return here, no index needed ... RET_MSB .byte >(label0-1), >(label1-1), ... RET_LSB .byte <(label0-1), <(label1-1), ... Am I missing something? Quote Link to comment Share on other sites More sharing options...
djmips Posted January 2, 2006 Share Posted January 2, 2006 lda #1; index value brklabel1; return here, no index needed That is still one more byte than supercat's method. And it requires an extra sta at the beginning of the kernel. I think if you use a macro for your calling method and have a MAX_INSTANCES counter in the macro, that counts up everytime a new return address is nececesary. In the same macro, I'd probably try something like generating a symbol iteratively like RETURN_X where X is the n'th instance. Then at table generation time, repeat while counting MAX_INSTANCES and output the RETURN_X symbols in your tables. I haven't done this in DASM but I have done something similar in other assemblers. Quote Link to comment Share on other sites More sharing options...
+batari Posted January 2, 2006 Share Posted January 2, 2006 (edited) lda #1; index value brk label1; return here, no index needed That is still one more byte than supercat's method. And it requires an extra sta at the beginning of the kernel. True, but Supercat's method would require a lengthly bit of code to fetch the index byte after the brk and store it in $FF. Probably 20 bytes of overhead, so you'd need to have 20 return addresses to break even. EDIT: Thinking more, maybe not quite 20, maybe 6: EDIT^2: Correcting myself again. Needs 8 bytes minimum to fetch the index byte: ldy #$FF dec $FF lda ($FE),y sta $FF Regardless, the issue is RAM, not ROM, and I admit that I still don't understand why non-consecutive ORG's are needed... Can someone enlighten me? Edited January 2, 2006 by batari Quote Link to comment Share on other sites More sharing options...
supercat Posted January 2, 2006 Author Share Posted January 2, 2006 EDIT^2: Correcting myself again. Needs 8 bytes minimum to fetch the index byte: ldy #$FF dec $FF lda ($FE),y sta $FF Regardless, the issue is RAM, not ROM, and I admit that I still don't understand why non-consecutive ORG's are needed... Can someone enlighten me? 992471[/snapback] At the moment I'm using LDA/JMP, but I thought the address pushed by the BRK instruction was the address of the byte following (which would mean the return would go to the byte after that). So I had been planning on: LDY #0 LDA ($FE),Y STA $FF which is four bytes more than 'just' an STA but it saves four bytes for each kernel call. Probably not a big deal either way. Is it possible to use multiple initialized segments and switch between in -f3 mode? I didn't think that could be done, so I used -f2 mode and threw together a post-processor. BTW, next goal is a 4-digit-plus-zero score kernel using seven bytes of scratchpad (and not using Venetian blinds if I can avoid it). Quote Link to comment Share on other sites More sharing options...
+batari Posted January 2, 2006 Share Posted January 2, 2006 At the moment I'm using LDA/JMP, but I thought the address pushed by the BRK instruction was the address of the byte following (which would mean the return would go to the byte after that). I thought so too initially, until I ran some test code through the Stella debugger: ORG $f000 start cld ldx #$ff txs brk .byte 1 break ldy #0 lda ($FE),y sta $ff org $fffc .word start .word break This code stored A0 instead of 1 in FF, so I edited my above post. Is it possible to use multiple initialized segments and switch between in -f3 mode? I didn't think that could be done, so I used -f2 mode and threw together a post-processor. 992551[/snapback] It's not possible. I think the -f2 segments can even overlap without generating errors, and this would be impossible to handle in -f3 mode. A postprocessor is not a big deal if it does what you need it to do. Quote Link to comment Share on other sites More sharing options...
supercat Posted January 2, 2006 Author Share Posted January 2, 2006 A postprocessor is not a big deal if it does what you need it to do. 992572[/snapback] True, but it makes the macros less useful for other people unless they want to use my tools. BTW, I've figured out how to shrink the score kernel's scratchpad usage to the point of being a non-issue. I don't think I'll bother with the full optimization since other parts of the display kernel would still need more scratchpad, but assuming score is kept in two byte BCD, a four-digit-plus-trailing-zero score kernel could be done only trashing two bytes of RAM (actually, it could be done trashing none, but the setup/cleanup would be annoying and I don't really see much point except as a challenge. CHALLENGE Assuming a player's score is kept in a two-byte BCD variable. Write a mini-kernel that will display the score, with a trailing zero, using five "close" eight-pixel-wide white sprites on a black background without trashing any RAM. All hardware registers are fair game provided they do not disrupt the display, and it's okay to trash RAM if it's restored afterward. Can anyone see how to do it? Some hints: -1- Unrolling the loop can help in a couple ways. -2- Some places can hold a few bits for safe-keeping but not whole bytes. -3- Will the score be turned into something else? Can it be reconstituted afterwards? -4- The TIA and RIOT may both be helpful. -5- Who needs a stack pointer? -6- I think the kernel itself will to use eight bytes of scratchpad (four pointers) and the A and Y registers, so 64 bits have to be stashed someplace; stash 8 bits each in X and S and that leaves 48. Can you do it? Quote Link to comment Share on other sites More sharing options...
djmips Posted January 2, 2006 Share Posted January 2, 2006 1. Unrolling usually helps because you won't need to keep a counter to iterate over the number of loops. Things that are variable can be fixed and not require ram. 2. Eep, you are free to use any hardware registers that allow read/write to store information as long as those registers are not required during your routine. Some candidates in the RIOT might be the Data Direction Registers (2 x 8 bits)... Seem to recall a warning about setting port B to output however. 3. Convert the score to pointers use the pointers during display and then convert the pointers back to the score.. 4. The TIA will be used for display and you state you are using 2 bytes of the 128 byte RIOT RAM. I keed. Actually I don't have an answer for 4. 5. Stack pointer can make a good fast temporary, but make sure you disable interrupts first. 6. 48 bits, well, 16 to the DDR as stated in 2. which leaves 32... Oh no, you're not planning on writing values into the timer and then reading them back and fixing up the value based on a fixed # of cycles that have passed... oh yes you are. . So that leaves 24, You are going to reuse the two byte BCD so that leaves 8 bits which are probably apportioned to other data, like probing the playfield with the ball and reading back stored data in the collision bits or some other hair-brained scheme like ask the user to hold or release the fire button (that's one bit right there!) ... but it's late.... Quote Link to comment Share on other sites More sharing options...
djmips Posted January 2, 2006 Share Posted January 2, 2006 Other crazy places to stash bits. the P register (ie flags), for a lot of code, you could be safe with the interrupt bit, the overflow bit (avoid carries from bit 6 to 7 in ADC and SBC and don't use BIT) depending on your code you might be able to use the carry but that's a lot tougher to avoid. Quote Link to comment Share on other sites More sharing options...
supercat Posted January 2, 2006 Author Share Posted January 2, 2006 2. Eep, you are free to use any hardware registers that allow read/write to store information as long as those registers are not required during your routine. Some candidates in the RIOT might be the Data Direction Registers (2 x 8 bits)... Seem to recall a warning about setting port B to output however. DDR's are too dangerous, IMHO, though I suppose selected bits of SWBCTRL might be okay since not all pins have stuff hooked up to them. 4. The TIA will be used for display and you state you are using 2 bytes of the 128 byte RIOT RAM. I keed. Actually I don't have an answer for 4. You could stash 20 bits in PF0-PF2, and 7 bits each in the position registers for M0, M1, and the Ball. Reading those bits out would be annoyingly slow (set all color registers to black after showing the digits, then watch the collision registers while using Player 0 and Player 1 to "probe" the other objects and playfield) 6. 48 bits, well, 16 to the DDR as stated in 2. which leaves 32... Oh no, you're not planning on writing values into the timer and then reading them back and fixing up the value based on a fixed # of cycles that have passed... oh yes you are. Or store the value in T1024T and read it back within 1024 cycles. So that leaves 24, You are going to reuse the two byte BCD so that leaves 8 bits which are probably apportioned to other data, like probing the playfield with the ball and reading back stored data in the collision bits or some other hair-brained scheme like ask the user to hold or release the fire button (that's one bit right there!) ... but it's late.... 992619[/snapback] Twelve bits in the pointer high-bytes (three bits each). Seven bits in the Y register (put the digit shape data in the second half of a page, and then adjust the pointers according to whatever value happens to be stored in Y). Other possibilities include the top 3 bits of PC, ENABL, CTRLPF bits 0 (reflection control) and 4-5 (Ball size). Some collision registers might also be useful. Quote Link to comment Share on other sites More sharing options...
Bruce Tomlin Posted January 6, 2006 Share Posted January 6, 2006 Each kernel call thus requires four bytes of ROM (one for the BRK, one for the index, and two for the return address). Unfortunately, the bytes are not consecutive. Presently, I'm using a kludged-together post-processor so I can use non-sequential ORG directives, but that seems rather ugly. Is there a better way to do things?I don't see why you need non-sequential ORG's. It seems to be that two tables could just contain .byte <(label-1) and .byte >(label-1) where "label" is placed where you want the code to return to. Am I missing something? Yes. You're missing that you have to keep track of the return table yourself. If you change your source code, you have to update it manually. My method with macros maintains the index and table automatically. Need another call to the kernal? Just drop another macro invocation. No muss, no fuss. Or store the value in T1024T and read it back within 1024 cycles.I seem to recall that the RIOT timers decrement the value immediately upon load, then take however many cycles between counts until they reach FF, at which point they decrement on every clock. If that's true, then T1024T would be a great place to store a loop counter. Quote Link to comment Share on other sites More sharing options...
supercat Posted January 6, 2006 Author Share Posted January 6, 2006 If that's true, then T1024T would be a great place to store a loop counter. 995005[/snapback] I'll have to check to see exactly how it behaves, but the most important thing is that it's predictable. BTW, in some games where there isn't a "natural" scan line counter (e.g. because everything on-screen can move vertically) INTIM can be excellent for that purpose. Just load it so it will "time out" after the proper number of scan lines (when possible, I use the 128->127 transistion). Checking it will then be a cycle cheaper than decrementing a memory location (and won't waste a byte of RAM). Quote Link to comment Share on other sites More sharing options...
supercat Posted January 6, 2006 Author Share Posted January 6, 2006 Yes. You're missing that you have to keep track of the return table yourself. If you change your source code, you have to update it manually. My method with macros maintains the index and table automatically. Need another call to the kernal? Just drop another macro invocation. No muss, no fuss.995005[/snapback] I decided to give another go to automating the label generation, and it actually is possible to do things automatically without requiring a preprocessor and without having to manually generate labels. Ugly is an understatement, however. A program with nothing except 30 kernel calls generates 120 bytes of binary data (as it should) and more than 5,800 lines of listing. Perhaps I need to figure out how to use listing controls within macros. Here's the code. Any suggestions for improvements? PROCESSOR 6502 mac qz_labeln ; qz_labeln {action},{label},{underscores} if {3} >= 5 if {3} >= 10 qz_labeln {1},{2}x,{3}-10 else qz_labeln {1},{2}v,{3}-5 endif else if {3} > 0 qz_labeln {1},{2}i,{3}-1 else {1} {2} endif endif endm mac qz_makelabel {1}: endm mac qz_makelsb_minus1 byte <(({1})-1) endm mac qz_makemsb_minus1 byte >(({1})-1) endm qz_kernelindex .set 0 mac DO_KERNEL brk byte qz_kernelindex qz_kernelindex .set qz_kernelindex+1 qz_labeln qz_makelabel,QZ_,qz_kernelindex endm mac QZ_MAKE_LSBS qz_temp .set 0 repeat qz_kernelindex qz_temp .set qz_temp+1 qz_labeln qz_makelsb_minus1,QZ_,qz_temp repend endm mac QZ_MAKE_MSBS qz_temp .set 0 repeat qz_kernelindex qz_temp .set qz_temp+1 qz_labeln qz_makemsb_minus1,QZ_,qz_temp repend endm seg code org $1000 DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL DO_KERNEL QZ_LSBS: QZ_MAKE_LSBS QZ_MSBS: QZ_MAKE_MSBS echo * .END Note that the QZ_MAKE_LSBS and QZ_MAKE_MSBS macros must be invoked after the last invocation of DO_KERNEL, but provided that is done the assembler will generate the necessary code without a preprocessor. I'll have to see if I can clean up the listing file, though, since I use listings for reference a lot. Quote Link to comment Share on other sites More sharing options...
+batari Posted January 6, 2006 Share Posted January 6, 2006 Am I missing something?Yes. You're missing that you have to keep track of the return table yourself. If you change your source code, you have to update it manually. My method with macros maintains the index and table automatically. Need another call to the kernal? Just drop another macro invocation. No muss, no fuss. 995005[/snapback] I see, and yes, you're right. I was guessing that maintaining the table wouldn't be that difficult unless there are a large number of return addresses. Exactly how big is the table? Big macros and postprocessors seem like overkill unless there are >20 entries or so. Quote Link to comment Share on other sites More sharing options...
Bruce Tomlin Posted January 7, 2006 Share Posted January 7, 2006 It occurs to me that if the assembler macros support looping, and you don't mind having to use a "finish it off" macro, the return labels table could be dumped in a loop. So each kernel call macro would generate "RETURNn" where 'n' was the label number, then at the end it would do a table full of DW (or two tables full of DB) to RETURN1, RETURN2, RETURN3, etc. You'd have to be realy tricky about it if you want to be able to use your kernel call macro _after_ the kernel return macro, though. Quote Link to comment Share on other sites More sharing options...
djmips Posted January 25, 2006 Share Posted January 25, 2006 Are you going to reveal your take on the 'CHALLENGE' John? - David Quote Link to comment Share on other sites More sharing options...
supercat Posted January 25, 2006 Author Share Posted January 25, 2006 Are you going to reveal your take on the 'CHALLENGE' John?1006757[/snapback] Well, I think I've described in this thread the spots that I would use for saving bits here and there. Some more practical than others, to be sure. Trying to get the RAM usage for the score routine down to zero is purely academic; it's interesting that it can be done, but other parts of the kernel will have non-zero requirements so it's not worth eking out every single byte of RAM (my guess is that kernel usage elsewhere will probably end up being eight bytes, with the "return" held in the stack pointer). If that's the case, unrolling the loop would make the digit kernel easy; if I don't want to unroll the loop I'll have to save a byte which I can shove into the MSB's of the pointer addresses. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.