Optimizing 6502 Assembly Programming

JamesD · April 28, 2015

I could adjust the value of commandtableH and commandtableL to reflect the negative value.

BTW, is there a better way to do this in as65? (not even sure if this code is legal)

commandtable1:
	.byt	>playsongend		; 0 end of song
	.byt	>playsongend		; just in case
commandtable2:
	.byt	<playsongend		; 0 end of song
	.byt	<playsongend		; just in case
.define commandtableH commandtable1+128 ; adjust for negative bit
.define commandtableL commandtable2+128 ; adjust for negative bit

Switched from #define to .define

*edit*

Ok, so .define doesn't work. It claims commandtableL and commandtableH are undefined.

This is what I had to do:

	tax					; move the command table offset to x
	lda	commandtableH+128,x		; load MSB of the command address, 128 added to offset negative bit
	pha					; push it on the stack
	lda	commandtableL+128,x		; load LSB of the command address, 128 added to offset negative bit
	pha					; push it on the stack	
	rts					; call address on the stack

...

commandtableH:
	.byt	>playsongend		; 0 end of song
	.byt	>playsongend		; just in case
commandtableL:
	.byt	<playsongend		; 0 end of song
	.byt	<playsongend		; just in case

It assembles anyway, I can't promise it works without testing. Apparently I have commands disabled with the Donkey Kong code so I tested it at some point and then disabled it. The output in the .lst file looks like what I want.

Yes I know as65 lets me define the table with .hibyte .lobyte but the assembler the code was originally written for doesn't and I may have to port it back to that assembler.

*edit*

That's 20 clock cycles... vs 14 best case for the stack based approach, 17 in the code from post #48 and 24 in post #47.

So for my music player and typical code it would be worse but under the proper conditions you might save some clock cycles.

You would need to know the stack won't overwrite your jump table, no interrupts can take place before the stack is restored, you have to know what the stack is supposed to be on exit and you have to use values that don't need the initial ASL all to save 6 clock cycles.

If you have to disable and re-enable interrupts the best case version is 18 clock cycles instead of 14.

FWIW, for my code I could store your jump table at the top of the stack if I used the proper entry values. Instead of using just the negative bit as a flag like I did, I'd have to use -2, -4, -6, -8, etc... which translate to $FE, $FC, $FA, $F8, etc...

To set it up I'd copy the existing stack down far enough to insert my table and adjust the stack pointer accordingly but you don't have to do that if you don't need to restore the stack on exit.

Edited April 29, 2015 by JamesD

JamesD · April 29, 2015

There is a way you can guarantee the best case for the stack based table jump.
If for example you have a game that can complete all it's updates in less than one interrupt cycle (vertical blank?) and then it waits for an interrupt before performing another update, the main game loop could save the stack pointer, set it to a fixed address and after an interrupt it could restore the stack pointer.
Then the interrupt handler can potentially use the best case code.

Since the main loop was waiting for the interrupt anyway, saving the stack pointer doesn't cost additional clock cycles.
If the stack pointer is saved to X with TSX, then restoring it only requires 2 clock cycles on return from the interrupt.

One problem, if you are waiting for an interrupt, why do you need to save that few clock cycles?
That would mean the delay would have to be short and you'd need to make enough table jumps inside the interrupt to save more clock cycles than the wait.

In theory

MaPa · April 29, 2015

And what is the purpose of your stack based jumps experiments?

JamesD · April 29, 2015

And what is the purpose of your stack based jumps experiments?

Creature XL posted it as a possible optimization and I just compared it to code I have already written for a music player I wrote.

The proposed optimization has a pretty narrow application since under situations like mine the code is actually slower.

I hadn't looked at your proposed code yet but as you suggest, I could also do this:

           asl
           sta    tcall+1
tcall:     jmp    (commandtable)

That looks like 11 clock cycles or almost half my current 6502 code. The one limitation is that it's not ROM safe like that so you have to add another JMP at 3 clock cycles in that case, -1 if the jmp (commandtable) is on page zero.

At 13 clock cycles (supposing ROM code and jmp on page zero) you are less than the stack based approach's best case without the ridiculous requirements and I believe that left out the ASL.

I was just followed an example on another web page with my original 6502 code.

If you have to update both addresses for the table I guess this approach might be slower.

As long as the table doesn't cross a page boundary that shouldn't be required though.

*edit*

BTW, the code I posted above is only 1 more clock cycle than the 65C02 version, though it's tables can cross a page boundary.

*IF* you could implement the stack based approach following the conditions I mention and without needing to restore the stack pointer over a group of calls you could save (I think) 1 clock cycle per jump and you would need over 7 jumps to save enough cycles to offset restoring the stack pointer. At least I think that's how it comes out. (non ROM code)

Edited April 29, 2015 by JamesD

MaPa · April 30, 2015

And you are calculating the command just before the call or is it more like lda command, asl, tax..... ?

JamesD · April 30, 2015

And you are calculating the command just before the call or is it more like lda command, asl, tax..... ?

My music player loads a BYTE from the song data into A that is supposed to indicate how many registers on the sound chip to modify and then uses a BMI to see if I have an embedded command in the song data instead.

The code posted is what is executed if that BMI is taken and cycle counts are based on the code posted alone.

The embedded commands are currently just a command number and the ASL converts it to a table offset where I include it.

Not all versions require the ASL like where I split the table into MSB and LSB.

FWIW, I haven't been able to think of a single application that would benefit from the stack based table "optimization" but someone else may figure out a use.for it

I *might* be able to modify my music player to use that approach by saving registers to page zero instead of the stack, but I have no music that would take advantage of it.

The player commands were intended for things like signalling the main application if the player gets to a certain point in a song, to implement repeats, counters, whatever... but I don't see that many commands being executed in one interrupt. On top of that I don't want to end up tearing my hair out trying to get it to work just to save a hand full of clock cycles.

But as I said, someone else may figure out a use for it.

*edit*

If this helps any:

           asl                       ; 2 clocks
           sta    tcall+1            ; 4 clocks
tcall:     jmp    (commandtable)     ; 5 clocks

I just added the clocks from memory and I'd have to look them up again to be certain.

The RTS is what kills my original call where the address is pushed on to the stack. It's 6 clock cycles.

Edited April 30, 2015 by JamesD

peteym5 · May 5, 2015

On the Vertical Blank interrupt not completing before a final RTI, one solution is a flag inside a custom NMI called "VBI_Complete" Do a BIT instruction, and if Neg is set, immediately jump to a RTI. Set the flag to 128 at the beginning of the VBI and LSR the flag before RTI. This could hold a game one frame if the VBI is incomplete.

Have to keep in mind that RTS table trick works best if you have 4 or more possible branches. Anything less, you are better doing CMP and BNE. I had situations where I was testing a flag for 20 possible values.

peteym5 · December 16, 2015

I had decided to revisit this. The RTS trick optimization has been implemented into "Secretum Labyrinth". I will let everyone in on something. All those screens use procedurally drawing. The room data has a flag for type of thing drawn, character, x pos, y pos, sizing, and other possible data. To draw a horizontal line, lets say the code is "$03" and I want to draw 20 characters with a byte of 72 at screen 10,10. It would be "DTA 03,72,10,10,20" I had looked at something further like combining the bits in the flag + screen offset from upper left, you will need to multiply 10x40 + 10 (=410; hi/lo=1,154), and rotate "Type Flag" left by 2 bits. So it will be now "DTA 12+1,154,72,20" that saves one byte per instruction. The RTS trick comes in handy for jumping to the horizontal line drawing programming if the code is 3. I am drawing whole screens that are around 1K in size using less than 80 bytes per room.

I looked at the ideal of using Deflate or other compression utilities to save each screen, but those ideal quickly failed because the procedural drawing proved to do it in much less memory.

Using optimization tricks is another way I am packing much more into these ROM cartridges.

snicklin · December 17, 2015

I had decided to revisit this. The RTS trick optimization has been implemented into "Secretum Labyrinth". I will let everyone in on something. All those screens use procedurally drawing.

Games like "Head over Heels" and "Spindizzy" which are huge also use this method. Similar methods are also used for graphics in a lot of graphic adventure games.

+Stephen · December 17, 2015

Pitfall on the 2600 was the 1st title I was aware that did this. Could have been others, but that one goes pretty far back.

Sign In

Optimizing 6502 Assembly Programming

Recommended Posts

JamesD

Link to comment

Share on other sites

JamesD

Link to comment

Share on other sites

MaPa

Link to comment

Share on other sites

JamesD

Link to comment

Share on other sites

MaPa

Link to comment

Share on other sites

JamesD

Link to comment

Share on other sites

peteym5

Link to comment

Share on other sites

peteym5

Link to comment

Share on other sites

snicklin

Link to comment

Share on other sites

+Stephen

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members

Apps

My Activity Streams

More