Sleeping with Optimal Efficiently

glurk · August 20, 2023

What a title, LOL. Anyway, I've written this kernel that requires entry on cycle 5, with TIM64T set to a certain value, and running. And so I had to do a 63 cycle delay, set the timer, and jump into the kernel. And my 63-cycle delay is this:

 ldx #11
.sleep
 dex
 bpl .sleep
 nop

That's 6 bytes, best I could come up with. Can it be done in 5? I dunno... So it got me thinking about how to delay for longer times, up to a scanline, in the absolute fewest bytes.

So it's an open question... Anyone ever thought about this before? I know some here (😃) like really optimized code...

+grafixbmp · August 21, 2023

Something I like to use is the last part of this guide. You can achieve more than listed by using combinations while keeping the rom footprint low.

https://www.qotile.net/minidig/docs/2600_advanced_prog_guide.txt

Anything more would require a loop

glurk · August 21, 2023

I'm aware of that guide, I use those little 'NOP-tricks' all the time. I was speaking of longer delays. The code I posted above does a 63 cycle delay. Remove the NOP at the end and it would be a 61 cycle delay. But what if I needed a 62 cycle one? I'd need a 1-cycle NOP, haha... So how best to do a 62 cycle delay, then?

That's what I was getting at, the most efficient way to do longer delays that require loops.

Here's a demo of the crazy kernel I've been working on, I think it's kinda neat so far: WIP.bin

JetSetIlly · August 21, 2023

Edited August 21, 2023 by JetSetIlly

glurk · August 21, 2023

3 hours ago, JetSetIlly said:
If you include the initial LDX as part of the cycle count then this is 62 cycles:
 ldx #10
.sleep
 dex
 bpl .sleep
 nop 0
 nop 0

Yes, but it is 9 bytes. Is that the shortest possible way? I'm working like the "old days" on a 2K project, and that's kinda why I was asking. If that loop used a "dec $2d" 5-cycle NOP, or maybe some other variation, maybe it could be shorter.

I'm thinking along the lines of a "2600 programmer advanced reference" with all delays up to 76 cycles (maybe more) in their most optimal (fewest bytes) form. That way, it would be possible to just look it up rather than figure it out from scratch every time...

There is a lot of stuff out there for beginning 2600 programming, but I was thinking something for advanced programmers to reference would be of use. And I was curious if anyone else had ever thought about that.

tracks guy · August 21, 2023

If you don't mind a few pushes to the stack, this is 5 bytes and 63 cycles:

    lda     #0x0A       ; 0A is the ASL instruction
    pha
    bne     *-2

Omegamatrix · August 21, 2023

This reminded me of a macro I wrote a while back. Similar situation where I didn't want to waste a lot of bytes for long periods of sleep, but also wanted flexibility during development. This macro uses X regisgter but of course it would be trivial to change it to use Y if needed. Right now when done X is always zero which can be very handy.

  MAC DELAY_X_CYCLES
; - needs to be a minimum of 13 cycles to delay
LOOP_COUNT SET ({1}-1)/5
REMAINDER SET ({1}-1) - (LOOP_COUNT*5)

  IF REMAINDER = 1
LOOP_COUNT SET LOOP_COUNT-1
    SLEEP 6
  ENDIF
  IF REMAINDER > 1
    SLEEP REMAINDER
  ENDIF

    ldx    #LOOP_COUNT
.loopDelayCycles:
    dex
    bne    .loopDelayCycles
  ENDM

Omegamatrix · August 21, 2023

1 hour ago, tracks guy said:
If you don't mind a few pushes to the stack, this is 5 bytes and 63 cycles:
    lda     #0x0A       ; 0A is the ASL instruction
    pha
    bne     *-2

I've always loved jumping between operator and operand. Very similar to CLEAN_START macro that I took a byte out of with this trick.

    cld
.loopClear:
    ldx    #$0A                  ; ASL opcode = $0A
    inx
    txs
    pha
    bne    .loopClear+1          ; jump between operator and operand to do ASL

Omegamatrix · August 21, 2023

One of my favorite tricks for wasting cycles in a kernal is taking advantage of the X register if has a known value. You can easily add 1 cycle delay to the code when storing zeropage.

Random Example Code, Definitely NOT from Venture Reloaded...

;---------------------------------------
    sty    VBLANK                ;3  @3
    stx    GRP1                  ;3  @6
    sta    GRP0                  ;3  @9
    DRAW_ARROW_16                ;16 @25
    ldx    #LEFT_2               ;2  @27
    ldy    ts_loopCount-LEFT_2,X ;4  @31   *** X always = LEFT_2 here, delay an extra cycle without penalty of wasting bytes as LDY ZP,X takes 2 bytes...
    lda    (ts_bowTwo),Y         ;5  @36
    dey                          ;2  @38
    bne    .loopBowAndArrows     ;2³ @40/41

You could maybe use the code below for a 62 byte delay, and squeeze another cycle out somewhere else using the trick above. When the code below is done A=0, Carry is clear, and X and Y are preserved.

    lda #$2A    ; ROL opcode
.loopDelay62:
    asl
    bne .loopDelay62-1  ; jump between operator and operand

+grafixbmp · August 22, 2023

Crazy idea. Maybe you could use a pre existing loop system like horizontal positioning with an extra branch at the end for a special case for the delay. Just set an object you won't use and branch out on it. Just a thought

glurk · August 22, 2023

LOTS of good ideas here!! 😃 I love stuff like this, I'd never have thought of the jumping between using ASL and ROL. And it preserves the X and Y registers, awesome. I'm actually using that one.

These ideas are the sort of thing I'd like to have in a reference. I keep having to look up various things in various books and places online, etc, and always wish I had all of it in one handy location. Something I hope to work on, time permitting...

+Stephen · August 22, 2023

I love these threads, there is always a way to shave one byte or one cycle it seems.

Omegamatrix · August 23, 2023

16 hours ago, glurk said:

LOTS of good ideas here!! 😃 I love stuff like this, I'd never have thought of the jumping between using ASL and ROL. And it preserves the X and Y registers, awesome. I'm actually using that one.

These ideas are the sort of thing I'd like to have in a reference. I keep having to look up various things in various books and places online, etc, and always wish I had all of it in one handy location. Something I hope to work on, time permitting...

Right now I am working on the macro I posted earlier to do a little more while keeping it general purpose. I found that in some cases adding a byte or two inside the delay loop was better for overall byte savings at certain cycles. At a high level the current macro fine tunes its cycle wastage using SLEEP. The exact number of bytes the routine uses varies, but it is always between 5-8 total. There is pattern to the amount of bytes used, and the pattern repeats so it is easy to determine for a certain delay how many bytes is required. The pattern is 5, 8, 6, 7, 7 bytes used (five variations total).

Inserting a single NOP into the main loop creates a new routine has a pattern to 7 different byte variations, and insert two NOP's makes another routine with 9 different varations. To figure out which routine is more optimal at any given cycle required evaluation. I put the pattern of the number of bytes for each routine in excel and looked at them side by side up to lowest common mutliple (i.e 5 byte pattern x 7 byte pattern, LCM = 35).

The original routine only ever used 5-8 bytes max, but the alternates can sometimes save bytes when the orginal used 7 or 8 bytes. This is determined automatically in new macro. A funny thing I found through this excercise was inserting NOP $EA for a 3 cycle delay made no improvements at all over the original routine or alternatives (even the alternate routine that used an extra 2 NOP's in the loop for an extra 4 cycle delay). That's just how it worked out.

Anyhow lots of detail there, but I already got the macro working and now I'm going to look at handling crossing page boundaries so that it dynamically adjusts to work as expected, and then make a Y index register variant before releasing.

Omegamatrix · August 23, 2023

With alternative routines outside the general purpose, it would be good to have a book of all the possible routines one could use, and could be handy in cases where bytes need to be squeezed or one register or another can't be used. They are all very niche though so I wouldn't put it into the general purpose routine, because the general purpose routine should have a defined state as an output (i.e. X=0, A and Y untouched, etc...).

Still though they can be a lot of fun to find a bunch of niche routines, and if nothing else it gets the creative juices flowing. The 4 byte routines are probably as small as you can go without relying on jumping into other parts of code which doesn't make it useful for others to use. There should be many more 5 byte routines then 4 byte, and even 6 byte routines could still be useful over the general purpose routine at certain cycles.

Here are a few 4 cycle routines. It is easy to see doing variants of like INX INX or DEY DEY creates a five byte routine with different cycles, but yeah lots of combinations possible.

; 39 cycle delay
    lda #$0A    ; ASL
.loopDelay39:
    bne .loopDelay39-1
    
; another 39 cycle delay...
    lda #$4A    ; LSR
.loopDelay39:
    bne .loopDelay39-1

; 49 cycle delay
    ldy #$88    ; DEY
.loop49
    bmi .loop49-1

; 124 cycle delay
    ldx #$E8    ; INX
.loopDelay124:
    bne .loopDelay124-1

+Andrew Davie · August 23, 2023

I wrote, 25 years ago...
https://www.biglist.com/lists/stella/archives/199805/msg00090.html

+MarcoJ · August 23, 2023

I came here expecting a discussion on human sleep hacking; trying to get the most efficient time programming for the minimum amount of sleep. Instead, it's about keeping the 6507 occupied for a while. Good to read the different responses.

Omegamatrix · August 24, 2023

Updated routines for delay using X or Y (DELAY_X and DELAY_Y).

NewMacros.h

Summary (DELAY_X):

; - works for any delay from 8 to 1280 cycles
; - only uses 5-9 bytes
; - no illegal opcodes
; - immune to crossing page boundaries
; - does not affect the state of A,Y,SP, or V/B/D/I/C flags
; - always finishes with X=0, Zero flag set, and Negative flag clear

I completely abandoned the alternate routines. Although working, the macro was getting bloated and hard to follow, and when I looked at handling crossing page boundaries it became quite complex with recursive conditions that required a ton of logic. It was completely against the spirit of the simple routine so I tossed it and simplified back to the original routine to update it from there.

What makes me happy with the new routines is that they are immune to crossing page boundaries so you can use it without hassle. They affect very little registers, leave with a defined state every time, use very few bytes for a long range of delay, and do not use illegal opcodes. 1280 cycles is probably more than anyone would ever need (16 scanlines + 64 cycles), but having a general purpose macro for long delays without consuming a lot of bytes is handy.

Omegamatrix · August 24, 2023

Getting around illegal opcodes for delay without affecting registers and handling page crossing boundaries was an interesting problem. Putting the extra bytes for wasting cycles after the delay loop simplified the problem as the extra bytes varied from 1 to 3 bytes at the time. It made the logic to check for page boundaries easier. Adjustments for crossing a page boundary were pretty easy, but a consequence of moving the extra bytes after the delay routine was that a solution for an odd cycle delay that didn't affect anything like illegal NOP's wasn't apparent. Even cycles were not a problem as you can just use regular NOP's.

I solved the problem by using a trick I made years ago and forgot about. Since a BNE was used in the loop it was guaranteed that the Zero flag would be set at exit so BEQ could be used as an unconditional branch. The trick though is you have to branch to where the branch instruction ends (essentially nowhere), and since the instruction ends where it is branching to the page is the same so no page crossing occurs. Maybe that is not exactly what happens on the sub-cycle instruction, but the importance here is that there is no page crossing penalty of an extra cycle. It wastes 3 bytes consistently without affecting any flags or registers, and is a legal work around. You just need to do it once if you have an odd cycle delay and then you can use normal NOP's after that. Of course any branch condition that is unconditional would have the same effect. In the routines I made BPL could also have been used since the Negative flag was clear.

Omegamatrix · August 24, 2023

Finally I would like to give credit to @supercat for the different page detection code in the macro. It was very simple and elegant to just use an EOR in the macro to detect a different page.

JetSetIlly · August 24, 2023

3 hours ago, Omegamatrix said:

Updated routines for delay using X or Y (DELAY_X and DELAY_Y).

Why have you chosen to avoid using illegal opcodes?

Omegamatrix · August 25, 2023

17 hours ago, JetSetIlly said:

Why have you chosen to avoid using illegal opcodes?

Great question. The main benefit is that the routine is now vanilla 6502 code which is universal for all 6502 systems and chip variants. This is desirable because you don't even need to consider if it will work on one machine and have unexpected behaviour on the next. It also don't exclude the routine from people who might not want to use illegal opcodes for whatever reason. The routine is simple, plain, and general purpose.

I myself use illegal ops all the time, and didn't start out with a goal of making this work without illegal ops. After I simplified the code to solve the page boundary problem it eliminated gobs of recursive IF statements to check for a multitude of scenarios. The problem then became apparent that the last part of the code still relied on the SLEEP macro, and that macro has a switch to compile with or without illegal opcodes. Concievably someone could set that switch and not be aware that it would affect this routine... in particular the SLEEP uses bit $0 instead NOP 0 so flags affected by the bit instruction (N/V/Z) can change.

It is more desirable to have the routine work like it does now because it avoids the problem altogether. I'd rather do that because it is simplier than having to remember that it behaves differently under a certain condition. Currently it ends with Z flag set, N flag clear, and V flag unchanged from entering the routine. The benefit is this allows people to easily add an unconditional branch after the routine to go somewhere else (2 bytes instead of 3 bytes for a jump instruction), or maybe branch conditionally depending on the state of the Overflow flag set before entering the routine.

Omegamatrix · August 26, 2023

Here is version 2 of the delay macros (DELAY_X and DELAY_Y):

DelayMacros.h

I woke up yesterday and realized it would be real easy to implement the alternate versions of the routine. All I had to do was produce each version of the routine for the desired delay and compare how many bytes each used. As simple as that sounds I had my face too far into the book and was going down rabbit holes of trying to recompare each routine based on certain delays, and if they crossed page boundaries which caused recursive checks. Just by running each routine to the end and then comparing the results eliminated all of that. I did wrap the routines into a loop though to condense the code.

Summary of DELAY_X. DELAY_Y is similar of course:

; DELAY_X
; By Omegamatrix
; Version 2 (Aug 26, 2023)
;  - Version 2 searches through versions of the delay routine to find
;    which one uses the least bytes, and then chooses it.
;  - Version 2 still follows all the critera from Version 1, listed in the summary below.
;
; Summary:
; - works for any delay from 8 to 1280 cycles
; - only uses 5-9 bytes
; - no illegal opcodes
; - immune to crossing page boundaries
; - does not affect the state of A,Y,SP, or V/B/D/I/C flags
; - always finishes with X=0, Zero flag set, and Negative flag clear

Omegamatrix · August 30, 2023

Here is version 3 of the Delay Macros:

DelayMacros(Ver3).zip

This version does better byte optimization when page boundaries are involved. The algorithm continues to look for which routine saves the most bytes, but now a few more scenarios are considered. Overall the routine is essentially just a delay loop with some extra bytes outside of the loop to make up any remaining cycles. In version 3 a scenario of having the extra bytes before or after the loop is now considered, where as version 2 just had them after the loop. Allowing this new scenario potentially prevents the page boundary crossing from happening.

Sometimes it uses less bytes when the delay loop does not have a page crossing, and sometimes it doesn't. Other times no matter where the routine goes a particular delay will always use the same amount of bytes (although the routines for page boundary and non-page boundary have to be different to make the same delay, obviously!). Therefore, a given delay has to have every scenario checked and total bytes considered to find the best solution.

I've included an excel sheet to show the total bytes for all delays with and without page boundary crossings. The range of the algorithm is now 14 instead of 8, because that is where the routine takes its first branch to do a loop. It was very inefficient for byte usage when delays were less than 14. I'm not saying it is the most efficient routine for a 14 byte delay (it is general purpose after all), but it is hard to justify having a routine with a delay loop in it unless it does at least one loop.

One more thing... If a page boundary is not encountered then the delay always takes 5-8 bytes which IMHO is not too bad. Only when a page boundary is encountered are 9 bytes occasionally needed. However, version 3 of the delay routines negates using some these 9 byte delays by finding other versions that work. I'm pretty happy with it overall.

Omegamatrix · September 4, 2023

Version 4 of the delay macros:

DelayMacros(Ver4).zip

I considered a few more sceranios. The first was moving the left over bytes ahead of the delay loop to try and deliberately force a page crossing, and then evaluate overall bytes needed. I also considered 3, 5, and 6 extra cycle delays in the loop. At that point the macro was pretty much unmanageable so I combined the main portions of the code for DELAY_X and DELAY_Y, and refactored it all. It is a lot cleaner now, and is easier to follow and maintain. Overall routine selection has increased for the better as many more instances of the delay got a byte or two shaved off.

Code is still generic and will work on any 6502 system whether it's an 2600 or not.

Sleeping with Optimal Efficiently

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members