Jump to content
IGNORED

Surviving on minimal stack using DASM--tricks?


supercat

Recommended Posts

For the Wormy game I'm working on (2600), I'm trying to cut kernel stack usage to the absolute minimum. Two bytes is unacceptable if one will do. Right now, I save a byte by generating a table of all the locations from which the kernel may be called (actually, a table of MSBs and LSBs). To invoke the kernel, I use a macro that generates:

 BRK
 byte n

where 'n' is zero for the first call, '1' for the second, etc. At the end of the kernel, I perform:

 ldx #$FF
 txs
 ldx $FF [saved index value]
 lda RET_MSB,x
 pha
 lda RET_LSB,x
 pha
 rts

Each kernel call thus requires four bytes of ROM (one for the BRK, one for the index, and two for the return address). Unfortunately, the bytes are not consecutive. Presently, I'm using a kludged-together post-processor so I can use non-sequential ORG directives, but that seems rather ugly. Is there a better way to do things?

 

(BTW, I'm well aware that the BRK trashes $FD-$FF, and the above code trashes $FE-$FF, but what matters is that the kernel is free to trash $FD and $FE.)

Link to comment
Share on other sites

I'm not sure what DASM will let you do, but what I've done on the 7800, where graphics are very non-consecutive, is to use a macro that does a bunch of SEG directives to drop the code into each area, then SEGs back to the code segment. Of course all this is preceeded by a bunch of SEGs and ORGs to initialize each segment's address.

 

Then it's just a matter of having "IF (. > $xxxx) / ERROR message / ENDIF" blocks to check that I haven't gone over the alloted space.

 

EDIT: something like this

      SEG jmptbl
     ORG JMPTAB
     SEG CODE
counter = 0

     MACRO  KERN
     BRK
     DB counter
counter = counter + 1
temp = .
     SEG jmptbl
     DW temp
     SEG CODE
     ENDM

Edited by Bruce Tomlin
Link to comment
Share on other sites

Each kernel call thus requires four bytes of ROM (one for the BRK, one for the index, and two for the return address).  Unfortunately, the bytes are not consecutive.  Presently, I'm using a kludged-together post-processor so I can use non-sequential ORG directives, but that seems rather ugly.  Is there a better way to do things?

992399[/snapback]

I don't see why you need non-sequential ORG's. It seems to be that two tables could just contain .byte <(label-1) and .byte >(label-1) where "label" is placed where you want the code to return to.

 

The index for the BRK isn't even needed since you don't really care what addresses get put into the stack. You could even pass the index in the accumulator and store it in $ff at the beginning of the kernel, then that would save the need to fetch the index from the byte after the BRK every time (is that what you're doing?)

 

e.g.:

 

 lda #1; index value
 brk
label1; return here, no index needed


...
RET_MSB
 .byte >(label0-1), >(label1-1), ...


RET_LSB
 .byte <(label0-1), <(label1-1), ...

Am I missing something?

Link to comment
Share on other sites

  lda #1; index value

  brk

label1; return here, no index needed

 

That is still one more byte than supercat's method. And it requires an extra sta at the beginning of the kernel.

 

I think if you use a macro for your calling method and have a MAX_INSTANCES counter in the macro, that counts up everytime a new return address is nececesary.

 

In the same macro, I'd probably try something like generating a symbol iteratively like RETURN_X where X is the n'th instance. Then at table generation time, repeat while counting MAX_INSTANCES and output the RETURN_X symbols in your tables.

 

I haven't done this in DASM but I have done something similar in other assemblers.

Link to comment
Share on other sites

 lda #1; index value
 brk
label1; return here, no index needed

 

That is still one more byte than supercat's method. And it requires an extra sta at the beginning of the kernel.

True, but Supercat's method would require a lengthly bit of code to fetch the index byte after the brk and store it in $FF. Probably 20 bytes of overhead, so you'd need to have 20 return addresses to break even.

 

EDIT: Thinking more, maybe not quite 20, maybe 6:

EDIT^2: Correcting myself again. Needs 8 bytes minimum to fetch the index byte:

 ldy #$FF
 dec $FF
 lda ($FE),y
 sta $FF

 

Regardless, the issue is RAM, not ROM, and I admit that I still don't understand why non-consecutive ORG's are needed... Can someone enlighten me?

Edited by batari
Link to comment
Share on other sites

EDIT^2: Correcting myself again.  Needs 8 bytes minimum to fetch the index byte:

 ldy #$FF
 dec $FF
 lda ($FE),y
 sta $FF

 

Regardless, the issue is RAM, not ROM, and I admit that I still don't understand why non-consecutive ORG's are needed... Can someone enlighten me?

992471[/snapback]

 

At the moment I'm using LDA/JMP, but I thought the address pushed by the BRK instruction was the address of the byte following (which would mean the return would go to the byte after that). So I had been planning on:

 LDY #0
 LDA ($FE),Y
 STA $FF

which is four bytes more than 'just' an STA but it saves four bytes for each kernel call. Probably not a big deal either way.

 

Is it possible to use multiple initialized segments and switch between in -f3 mode? I didn't think that could be done, so I used -f2 mode and threw together a post-processor.

 

BTW, next goal is a 4-digit-plus-zero score kernel using seven bytes of scratchpad (and not using Venetian blinds if I can avoid it).

Link to comment
Share on other sites

At the moment I'm using LDA/JMP, but I thought the address pushed by the BRK instruction was the address of the byte following (which would mean the return would go to the byte after that).

I thought so too initially, until I ran some test code through the Stella debugger:

 ORG $f000
start
 cld
 ldx #$ff
 txs
 brk
 .byte 1
break
 ldy #0
 lda ($FE),y
 sta $ff

org $fffc
.word start
.word break

This code stored A0 instead of 1 in FF, so I edited my above post.

 

Is it possible to use multiple initialized segments and switch between in -f3 mode?  I didn't think that could be done, so I used -f2 mode and threw together a post-processor.

992551[/snapback]

It's not possible. I think the -f2 segments can even overlap without generating errors, and this would be impossible to handle in -f3 mode.

 

A postprocessor is not a big deal if it does what you need it to do.

Link to comment
Share on other sites

A postprocessor is not a big deal if it does what you need it to do.

992572[/snapback]

 

True, but it makes the macros less useful for other people unless they want to use my tools.

 

BTW, I've figured out how to shrink the score kernel's scratchpad usage to the point of being a non-issue. I don't think I'll bother with the full optimization since other parts of the display kernel would still need more scratchpad, but assuming score is kept in two byte BCD, a four-digit-plus-trailing-zero score kernel could be done only trashing two bytes of RAM (actually, it could be done trashing none, but the setup/cleanup would be annoying and I don't really see much point except as a challenge.

 

CHALLENGE

 

Assuming a player's score is kept in a two-byte BCD variable. Write a mini-kernel that will display the score, with a trailing zero, using five "close" eight-pixel-wide white sprites on a black background without trashing any RAM. All hardware registers are fair game provided they do not disrupt the display, and it's okay to trash RAM if it's restored afterward.

 

Can anyone see how to do it?

 

Some hints:

 

-1- Unrolling the loop can help in a couple ways.

 

-2- Some places can hold a few bits for safe-keeping but not whole bytes.

 

-3- Will the score be turned into something else? Can it be reconstituted afterwards?

 

-4- The TIA and RIOT may both be helpful.

 

-5- Who needs a stack pointer?

 

-6- I think the kernel itself will to use eight bytes of scratchpad (four pointers) and the A and Y registers, so 64 bits have to be stashed someplace; stash 8 bits each in X and S and that leaves 48. Can you do it?

Link to comment
Share on other sites

1. Unrolling usually helps because you won't need to keep a counter to iterate over the number of loops. Things that are variable can be fixed and not require ram.

 

2. Eep, you are free to use any hardware registers that allow read/write to store information as long as those registers are not required during your routine. Some candidates in the RIOT might be the Data Direction Registers (2 x 8 bits)... Seem to recall a warning about setting port B to output however.

 

3. Convert the score to pointers use the pointers during display and then convert the pointers back to the score..

 

4. The TIA will be used for display and you state you are using 2 bytes of the 128 byte RIOT RAM. I keed. Actually I don't have an answer for 4.

 

5. Stack pointer can make a good fast temporary, but make sure you disable interrupts first. :)

 

6. 48 bits, well, 16 to the DDR as stated in 2. which leaves 32... Oh no, you're not planning on writing values into the timer and then reading them back and fixing up the value based on a fixed # of cycles that have passed... oh yes you are. :roll: . So that leaves 24, You are going to reuse the two byte BCD so that leaves 8 bits which are probably apportioned to other data, like probing the playfield with the ball and reading back stored data in the collision bits or some other hair-brained scheme like ask the user to hold or release the fire button (that's one bit right there!) :D :D ... but it's late.... :sleep:

Link to comment
Share on other sites

Other crazy places to stash bits. the P register (ie flags), for a lot of code, you could be safe with the interrupt bit, the overflow bit (avoid carries from bit 6 to 7 in ADC and SBC and don't use BIT) depending on your code you might be able to use the carry but that's a lot tougher to avoid.

Link to comment
Share on other sites

2. Eep, you are free to use any hardware registers that allow read/write to store information as long as those registers are not required during your routine. Some candidates in the RIOT might be the Data Direction Registers (2 x 8 bits)... Seem to recall a warning about setting port B to output however.

 

DDR's are too dangerous, IMHO, though I suppose selected bits of SWBCTRL might be okay since not all pins have stuff hooked up to them.

 

4.  The TIA will be used for display and you state you are using 2 bytes of the 128 byte RIOT RAM.  I keed. Actually I don't have an answer for 4.

 

You could stash 20 bits in PF0-PF2, and 7 bits each in the position registers for M0, M1, and the Ball. Reading those bits out would be annoyingly slow (set all color registers to black after showing the digits, then watch the collision registers while using Player 0 and Player 1 to "probe" the other objects and playfield)

 

6. 48 bits, well, 16 to the DDR as stated in 2. which leaves 32...  Oh no, you're not planning on writing values into the timer and then reading them back and fixing up the value based on a fixed # of cycles that have passed...  oh yes you are.  :roll:

 

Or store the value in T1024T and read it back within 1024 cycles.

 

So that leaves 24, You are going to reuse the two byte BCD so that leaves 8  bits which are probably apportioned to other data, like probing the playfield with the ball and reading back stored data in the collision bits or some other hair-brained scheme like ask the user to hold or release the fire button (that's one bit right there!)  :D  :D  ... but it's late....  :sleep:

992619[/snapback]

 

Twelve bits in the pointer high-bytes (three bits each). Seven bits in the Y register (put the digit shape data in the second half of a page, and then adjust the pointers according to whatever value happens to be stored in Y). Other possibilities include the top 3 bits of PC, ENABL, CTRLPF bits 0 (reflection control) and 4-5 (Ball size). Some collision registers might also be useful.

Link to comment
Share on other sites

Each kernel call thus requires four bytes of ROM (one for the BRK, one for the index, and two for the return address).  Unfortunately, the bytes are not consecutive.  Presently, I'm using a kludged-together post-processor so I can use non-sequential ORG directives, but that seems rather ugly.  Is there a better way to do things?
I don't see why you need non-sequential ORG's. It seems to be that two tables could just contain .byte <(label-1) and .byte >(label-1) where "label" is placed where you want the code to return to.

 

Am I missing something?

Yes. You're missing that you have to keep track of the return table yourself. If you change your source code, you have to update it manually. My method with macros maintains the index and table automatically. Need another call to the kernal? Just drop another macro invocation. No muss, no fuss.

 

Or store the value in T1024T and read it back within 1024 cycles.
I seem to recall that the RIOT timers decrement the value immediately upon load, then take however many cycles between counts until they reach FF, at which point they decrement on every clock.

 

If that's true, then T1024T would be a great place to store a loop counter.

Link to comment
Share on other sites

If that's true, then T1024T would be a great place to store a loop counter.

995005[/snapback]

 

I'll have to check to see exactly how it behaves, but the most important thing is that it's predictable. BTW, in some games where there isn't a "natural" scan line counter (e.g. because everything on-screen can move vertically) INTIM can be excellent for that purpose. Just load it so it will "time out" after the proper number of scan lines (when possible, I use the 128->127 transistion). Checking it will then be a cycle cheaper than decrementing a memory location (and won't waste a byte of RAM).

Link to comment
Share on other sites

Yes.  You're missing that you have to keep track of the return table yourself.  If you change your source code, you have to update it manually.  My method with macros maintains the index and table automatically.  Need another call to the kernal?  Just drop another macro invocation.  No muss, no fuss.

995005[/snapback]

 

I decided to give another go to automating the label generation, and it actually is possible to do things automatically without requiring a preprocessor and without having to manually generate labels. Ugly is an understatement, however. A program with nothing except 30 kernel calls generates 120 bytes of binary data (as it should) and more than 5,800 lines of listing. Perhaps I need to figure out how to use listing controls within macros. :)

 

Here's the code. Any suggestions for improvements?

 PROCESSOR 6502

       mac     qz_labeln ; qz_labeln {action},{label},{underscores}
       if {3} >= 5
         if {3} >= 10
           qz_labeln {1},{2}x,{3}-10
         else 
           qz_labeln {1},{2}v,{3}-5
         endif
       else
         if {3} > 0
           qz_labeln {1},{2}i,{3}-1
         else
           {1} {2}
         endif
       endif
       endm

       mac     qz_makelabel
{1}:
       endm

       mac     qz_makelsb_minus1
       byte    <(({1})-1)
       endm

       mac     qz_makemsb_minus1
       byte    >(({1})-1)
       endm

qz_kernelindex  .set    0

       mac     DO_KERNEL
       brk
       byte    qz_kernelindex
qz_kernelindex  .set qz_kernelindex+1
       qz_labeln qz_makelabel,QZ_,qz_kernelindex
       endm

       mac     QZ_MAKE_LSBS
qz_temp .set    0
       repeat  qz_kernelindex
qz_temp .set    qz_temp+1
       qz_labeln qz_makelsb_minus1,QZ_,qz_temp
       repend
       endm

       mac     QZ_MAKE_MSBS
qz_temp .set    0
       repeat  qz_kernelindex
qz_temp .set    qz_temp+1
       qz_labeln qz_makemsb_minus1,QZ_,qz_temp
       repend
       endm

       seg code
       org     $1000



       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
       DO_KERNEL
QZ_LSBS:
               QZ_MAKE_LSBS
QZ_MSBS:
               QZ_MAKE_MSBS

               echo *
               .END

 

Note that the QZ_MAKE_LSBS and QZ_MAKE_MSBS macros must be invoked after the last invocation of DO_KERNEL, but provided that is done the assembler will generate the necessary code without a preprocessor. I'll have to see if I can clean up the listing file, though, since I use listings for reference a lot.

Link to comment
Share on other sites

Am I missing something?
Yes. You're missing that you have to keep track of the return table yourself. If you change your source code, you have to update it manually. My method with macros maintains the index and table automatically. Need another call to the kernal? Just drop another macro invocation. No muss, no fuss.

995005[/snapback]

I see, and yes, you're right. I was guessing that maintaining the table wouldn't be that difficult unless there are a large number of return addresses. Exactly how big is the table? Big macros and postprocessors seem like overkill unless there are >20 entries or so.

Link to comment
Share on other sites

It occurs to me that if the assembler macros support looping, and you don't mind having to use a "finish it off" macro, the return labels table could be dumped in a loop. So each kernel call macro would generate "RETURNn" where 'n' was the label number, then at the end it would do a table full of DW (or two tables full of DB) to RETURN1, RETURN2, RETURN3, etc.

 

You'd have to be realy tricky about it if you want to be able to use your kernel call macro _after_ the kernel return macro, though.

Link to comment
Share on other sites

  • 3 weeks later...
Are you going to reveal your take on the 'CHALLENGE' John?

1006757[/snapback]

 

Well, I think I've described in this thread the spots that I would use for saving bits here and there. Some more practical than others, to be sure. Trying to get the RAM usage for the score routine down to zero is purely academic; it's interesting that it can be done, but other parts of the kernel will have non-zero requirements so it's not worth eking out every single byte of RAM (my guess is that kernel usage elsewhere will probably end up being eight bytes, with the "return" held in the stack pointer). If that's the case, unrolling the loop would make the digit kernel easy; if I don't want to unroll the loop I'll have to save a byte which I can shove into the MSB's of the pointer addresses.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...