Faster Caching

+bent_pin · July 28, 2023

I have a 115 byte kernel to cache and the loop takes quite a bit of time:

; RAM_Kernel is in ROM
; Kernel is in Zeropage

Cache_the_Kernel:				; CLK	ROM
	ldx #End_Kernel-RAM_Kernel		; 2	2
Cache_Kernel_Loop:
	lda RAM_Kernel-1,X			; 4	3
	sta Kernel-1,X				; 4	2
	dex					; 2	1
	bne Cache_Kernel_Loop			; 3/2	2
	;   	      2 + 13 * 112 + 12		= 1470	10
	; 19 lines + 26 clocks
	
	;;;;;;
	; 47 free clocks
	;;;;;;
	
	sta WSYNC ; 20 lines

So I took a different tack:

Cache_the_Kernel:				; CLK		ROM
	ldx #End_Kernel-RAM_Kernel		; 2		2		
Cache_Kernel_Loop:
	lda RAM_Kernel-1,X			; 4		3
	sta Kernel-1,X				; 4		2
	lda RAM_Kernel-2,X			; 4		3
	sta Kernel-2,X				; 4		2
	lda RAM_Kernel-3,X			; 4		3
	sta Kernel-3,X				; 4		2
	lda RAM_Kernel-4,X			; 4		3
	sta Kernel-4,X				; 4		2
	lda RAM_Kernel-5,X			; 4		3
	sta Kernel-5,X				; 4		2
	txa					; 2		1
	sbc #5					; 2		2
	tax					; 2		1
	bne Cache_Kernel_Loop			; 3/2		2
	; 		   2 + 49 * 22 + 48 	= 1128		33 Bytes
	; 14 lines + 64 clocks
	
	;;;;;;
	; 8 free clocks
	;;;;;;
	
	sta WSYNC ; 15 lines

I'll take a speed increase like that any day. What are some faster or different ways of achieving the same goal?

RevEng · July 28, 2023

You could use PHA in place of the indexed writes, and you'll save 1 cycle per write.

You could also save 2 cycles on your X adjustment if you're willing to use undocumented opcodes. (LDA #$FF, SBX #5)

[edit: TXA; SBX #5 would be 1 byte less]

Thomas Jentzsch · July 28, 2023

You could do:

Cache_the_Kernel				; CLK		ROM
	ldx #(End_Kernel-RAM_Kernel)/5-1        ; 2		2		
Cache_Kernel_Loop
	lda RAM_Kernel,X			; 4		3
	sta Kernel,X				; 4		2
	lda RAM_Kernel+115/5*1,X	        ; 4		3
	sta Kernel+115/5*1,X		        ; 4		2
	lda RAM_Kernel+115/5*2,X	        ; 4		3
	sta Kernel+115/5*2,X		        ; 4		2
	lda RAM_Kernel+115/5*3,X       	        ; 4		3
	sta Kernel+115/5*3,X		        ; 4		2
	lda RAM_Kernel+115/5*4,X	        ; 4		3
	sta Kernel+115/5*4,X		        ; 4		2
        dex					; 2             1
        bpl Cache_Kernel_Loop	        	; 3		2

But using PHA is even more efficient here.

BTW: Your code needs one SEC before the loop.

Edited July 28, 2023 by Thomas Jentzsch

+bent_pin · July 28, 2023

Edit: Wait one, made a big mistake. Code to follow.

+bent_pin · July 28, 2023

; RAM_Kernel org = new page + Kernel
; Places them at the correct offset to use
; the stack pointer as the X offset in ROM

; This assumes an empty stack
Cache_the_Kernel:				; CLK	ROM
	ldx #Kernel+End_Kernel-RAM_Kernel	; 2 	2  
	txs					; 2     1
Cache_Kernel_Loop:
	lda RAM_Kernel-1,X			; 4	3
	pha					; 3	1
	lda RAM_Kernel-2,X			; 4	3
	pha					; 3	1
	lda RAM_Kernel-3,X			; 4	3
	pha					; 3	1
	lda RAM_Kernel-4,X			; 4	3
	pha					; 3	1
	lda RAM_Kernel-5,X			; 4	3
	pha					; 3	1
	tsx					; 2	1
	cpx #Kernel				; 2	2
	bne Cache_Kernel_Loop			; 3/2	2
	ldx #ff					; 2	2
	txs					; 2	1
	; 4 + 42 * 22 + 41 + 4 = 973 Clocks 29 bytes of ROM
	; 12 lines + 61 clocks
	
	;;;;;;
	; 12 free clocks
	;;;;;;
	
	sta WSYNC ; 13 lines

Ok, I think that's the best that I can make it. I considered all the wonderful advice, and it made great improvements.

The principle is that the code in ROM shares the same relative address as the Kernel space in ZP. That way, when SP is transferred to X to be compared, it's also suitable as the X offset for the next iteration of the loop. This code is untested, but I think it should work.

Any mistakes or suggestions?

You all rock, thanks for your help.

Thomas Jentzsch · July 28, 2023

If the code is at the beginning of the ZP-RAM you can use BMI after TSX instead.

+bent_pin · July 28, 2023

22 minutes ago, Thomas Jentzsch said:

If the code is at the beginning of the ZP-RAM you can use BMI after TSX instead.

I put my variables in the beginning of RAM, then the Kernel space, then the stack. That way when I have smaller kernels I can use more of the stack. Is that wrong-think? Would BMI provide a benefit to speed or ROM size?

; RAM_Kernel org = new page + Kernel
; Places them at the correct offset to use
; the stack pointer as the X offset in ROM

Cache_the_Kernel:				; CLK	ROM
	tsx					; 2	1
	txa					; 2	1
	tay					; 2	1
	ldx #Kernel+End_Kernel-RAM_Kernel	; 2 	2  
	txs					; 2     1
Cache_Kernel_Loop:
	lda RAM_Kernel-1,X			; 4	3
	pha					; 3	1
	lda RAM_Kernel-2,X			; 4	3
	pha					; 3	1
	lda RAM_Kernel-3,X			; 4	3
	pha					; 3	1
	lda RAM_Kernel-4,X			; 4	3
	pha					; 3	1
	lda RAM_Kernel-5,X			; 4	3
	pha					; 3	1
	tsx					; 2	1
	cpx #Kernel				; 2	2
	bne Cache_Kernel_Loop			; 3/2	2
	tya					; 2	1
	tax					; 2	1
	txs					; 2	1
	; 10 + 42 * 22 + 41 + 6 = 981 Clocks 34 bytes of ROM
	; 12 lines + 69 clocks
	
	;;;;;;
	; 4 free clocks
	;;;;;;
	
	sta WSYNC ; 13 lines

Also, fixed the empty stack assumption.

Thomas Jentzsch · July 28, 2023

You save the CPX, that's 2 cycles and 2 bytes.

BTW: Why do you need the copying to be that fast? Usually you copy once and then modify only.

+bent_pin · July 28, 2023

41 minutes ago, Thomas Jentzsch said:

You save the CPX, that's 2 cycles and 2 bytes.

I was just realizing that. I will have to think about that and how I organize my ZP.

41 minutes ago, Thomas Jentzsch said:

BTW: Why do you need the copying to be that fast? Usually you copy once and then modify only.

Great question, I'm trying to create a few different self-modifying kernels. I'd like to have flexibility in what I can do. on the same screen. I want to explore more advanced SMC kernels such as you and other talented users have shared, but I also want to try SMC with traditional code.

The 115 line kernel is a two-liner that provides:
	1 line resolution to: 
		sprite 0, 1
		
	2 line resolution to:
		background color
		playfield 0, 1, 2
		player color
		ball
		missile 0, 1

May not seem like much but it has unlimited height and allows the playfield to be moved while rendering the players. Any of these can be anywhere on the visible line without restriction. A lot of games that I see limit your horizontal movement because they don't render everything in time to give full-width access. Just seemed like a nice gimmick to have more interactive moving objects while sticking to generally traditional loads. Everything is finished loading in the HBlank and uses the visible scanline to fetch ROM values and modify itself for the next iteration. You have to specify the width/repeat of the sprites, the width of the missiles, and the playfield color ahead of time. I just wanted to see how much stuff I could cram into a kernel that more or less used the Atari as designed.

So, I can cache this inside of the frame. Then, move onto a different kernel 15 or so lines later.

Edit: I was unable to achieve this without the SMC. I'm always happy to learn something by being disproven.

Thomas Jentzsch · July 29, 2023

That's pretty ambitious. I hope it works. Looking forward to the result.

+MarcoJ · July 29, 2023

@bent_pin a great discipline you're taking up. You mentioned before you're teaching a class, are you planning to make a game too?

+bent_pin · July 29, 2023

7 hours ago, MarcoJ said:

@bent_pin a great discipline you're taking up. You mentioned before you're teaching a class, are you planning to make a game too?

I currently teach each student individually, online. 2600 assembly has been part of the course for years, but just the super basic games like pong and combat. It's a great way to introduce relatively simple assembly before moving on to microcontrollers. Ideally, I start with each student between 7 and 9, with a goal of understanding basic abstract math and structured programming within 2 years.

I have made many games but they are all deritive works and not worth sharing. However, when I saw Mr Run and Jump and how much it cost, I decided to make my own stick figure adventure game. I had quite a bit finished but upon learning that self modifying code was possible at this level, with only 128 bytes, I pitched my game and started over with the intention of really pushing my limits.

Still making the same game. The main character is called stick. Stick listens to the invasive thoughts and ends up in situations, Sticky Situations, and you have to keep saving him. It's a single player game with a 16x20 pixel player sprite. He can run, jump, climb, slide, wall-kick, fall, and swim.

+bent_pin · August 9, 2023

@RevEng @Thomas Jentzsch

Here is a final working version that uses both of your suggestions:

	; Preserve stack pointer				; CLK	ROM
	tsx							; 2	1
	txa							; 2	1
	tay							; 2	1
	
	; Set sp to the end of zp kernel space
	ldx #Kernel - 1 + #End_Kernel - #RAM_Kernel		; 2	3
	txs							; 2	1
	
Cache_the_Kernel:
	lda $fdff,X						; 4	3
	pha							; 3	1
	lda $fdfe,X						; 4	3
	pha							; 3	1
	lda $fdfd,X						; 4	3
	pha							; 3	1
	lda $fdfc,X						; 4	3
	pha							; 3	1
	lda $fdfb,X						; 4	3
	pha							; 3	1
	tsx							; 2	1
	bmi Cache_the_Kernel					; 3/2	2
	; Runs as far down as zp$7c
		
	; Restore the stack pointer
	tya							; 2	1
	tax							; 2	1
	txs							; 2	1
	
	; 10 + n x 8 + 39 + 6
	; around 8.6 clocks per byte
	; around 8.8 bytes per scanline
	; 33 bytes of ROM space

Am I creating page boundary issues with my lda addressing the end of the previous page? I could just change the ROM position by -5 to be able to raise the address starting points.

Any other suggestions?

Thanks for all the tips so far.

glurk · August 9, 2023

Really negligible, but tsx, stx TEMP / ldx TEMP, txs would be a tiny bit quicker if you have a TEMP zp available.

+bent_pin · August 9, 2023

35 minutes ago, glurk said:

Really negligible, but tsx, stx TEMP / ldx TEMP, txs would be a tiny bit quicker if you have a TEMP zp available.

~~I'm not sure that I get it. Could you please elaborate a bit?~~

Edit: I see, for preserving the stack pointer, looks like 2 extra clocks. Every bit helps.

Thomas Jentzsch · August 10, 2023

If the stackpointer value is always the same, you can skip saving it.

+bent_pin · August 10, 2023

4 hours ago, Thomas Jentzsch said:

If the stackpointer value is always the same, you can skip saving it.

Good point, I'll keep that in mind when using this. I moved the kernel to the beginning of zp. I need to consider the game as a whole to decide how I will handle the remaining variables. I may ditch the stack other than subroutines.

Faster Caching

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members