Jump to content
IGNORED

Faster Caching


bent_pin

Recommended Posts

I have a 115  byte kernel to cache and the loop takes quite a bit of time:

; RAM_Kernel is in ROM
; Kernel is in Zeropage

Cache_the_Kernel:				; CLK	ROM
	ldx #End_Kernel-RAM_Kernel		; 2	2
Cache_Kernel_Loop:
	lda RAM_Kernel-1,X			; 4	3
	sta Kernel-1,X				; 4	2
	dex					; 2	1
	bne Cache_Kernel_Loop			; 3/2	2
	;   	      2 + 13 * 112 + 12		= 1470	10
	; 19 lines + 26 clocks
	
	;;;;;;
	; 47 free clocks
	;;;;;;
	
	sta WSYNC ; 20 lines

 

 

So I took a different tack:

Cache_the_Kernel:				; CLK		ROM
	ldx #End_Kernel-RAM_Kernel		; 2		2		
Cache_Kernel_Loop:
	lda RAM_Kernel-1,X			; 4		3
	sta Kernel-1,X				; 4		2
	lda RAM_Kernel-2,X			; 4		3
	sta Kernel-2,X				; 4		2
	lda RAM_Kernel-3,X			; 4		3
	sta Kernel-3,X				; 4		2
	lda RAM_Kernel-4,X			; 4		3
	sta Kernel-4,X				; 4		2
	lda RAM_Kernel-5,X			; 4		3
	sta Kernel-5,X				; 4		2
	txa					; 2		1
	sbc #5					; 2		2
	tax					; 2		1
	bne Cache_Kernel_Loop			; 3/2		2
	; 		   2 + 49 * 22 + 48 	= 1128		33 Bytes
	; 14 lines + 64 clocks
	
	;;;;;;
	; 8 free clocks
	;;;;;;
	
	sta WSYNC ; 15 lines

 

I'll take a speed increase like that any day. What are some faster or different ways of achieving the same goal?

Link to comment
Share on other sites

You could do:

Cache_the_Kernel				; CLK		ROM
	ldx #(End_Kernel-RAM_Kernel)/5-1        ; 2		2		
Cache_Kernel_Loop
	lda RAM_Kernel,X			; 4		3
	sta Kernel,X				; 4		2
	lda RAM_Kernel+115/5*1,X	        ; 4		3
	sta Kernel+115/5*1,X		        ; 4		2
	lda RAM_Kernel+115/5*2,X	        ; 4		3
	sta Kernel+115/5*2,X		        ; 4		2
	lda RAM_Kernel+115/5*3,X       	        ; 4		3
	sta Kernel+115/5*3,X		        ; 4		2
	lda RAM_Kernel+115/5*4,X	        ; 4		3
	sta Kernel+115/5*4,X		        ; 4		2
        dex					; 2             1
        bpl Cache_Kernel_Loop	        	; 3		2

But using PHA is even more efficient here. 

 

BTW: Your code needs one SEC before the loop.

Edited by Thomas Jentzsch
  • Thanks 1
Link to comment
Share on other sites

; RAM_Kernel org = new page + Kernel
; Places them at the correct offset to use
; the stack pointer as the X offset in ROM

; This assumes an empty stack
Cache_the_Kernel:				; CLK	ROM
	ldx #Kernel+End_Kernel-RAM_Kernel	; 2 	2  
	txs					; 2     1
Cache_Kernel_Loop:
	lda RAM_Kernel-1,X			; 4	3
	pha					; 3	1
	lda RAM_Kernel-2,X			; 4	3
	pha					; 3	1
	lda RAM_Kernel-3,X			; 4	3
	pha					; 3	1
	lda RAM_Kernel-4,X			; 4	3
	pha					; 3	1
	lda RAM_Kernel-5,X			; 4	3
	pha					; 3	1
	tsx					; 2	1
	cpx #Kernel				; 2	2
	bne Cache_Kernel_Loop			; 3/2	2
	ldx #ff					; 2	2
	txs					; 2	1
	; 4 + 42 * 22 + 41 + 4 = 973 Clocks 29 bytes of ROM
	; 12 lines + 61 clocks
	
	;;;;;;
	; 12 free clocks
	;;;;;;
	
	sta WSYNC ; 13 lines

Ok, I think that's the best that I can make it. I considered all the wonderful advice, and it made great improvements.

 

The principle is that the code in ROM shares the same relative address as the Kernel space in ZP. That way, when SP is transferred to X to be compared, it's also suitable as the X offset for the next iteration of the loop. This code is untested, but I think it should work.

 

Any mistakes or suggestions?

 

You all rock, thanks for your help.

Link to comment
Share on other sites

22 minutes ago, Thomas Jentzsch said:

If the code is at the beginning of the ZP-RAM you can use BMI after TSX instead.

I put my variables in the beginning of RAM, then the Kernel space, then the stack. That way when I have smaller kernels I can use more of the stack. Is that wrong-think? Would BMI provide a benefit to speed or ROM size?

 

; RAM_Kernel org = new page + Kernel
; Places them at the correct offset to use
; the stack pointer as the X offset in ROM

Cache_the_Kernel:				; CLK	ROM
	tsx					; 2	1
	txa					; 2	1
	tay					; 2	1
	ldx #Kernel+End_Kernel-RAM_Kernel	; 2 	2  
	txs					; 2     1
Cache_Kernel_Loop:
	lda RAM_Kernel-1,X			; 4	3
	pha					; 3	1
	lda RAM_Kernel-2,X			; 4	3
	pha					; 3	1
	lda RAM_Kernel-3,X			; 4	3
	pha					; 3	1
	lda RAM_Kernel-4,X			; 4	3
	pha					; 3	1
	lda RAM_Kernel-5,X			; 4	3
	pha					; 3	1
	tsx					; 2	1
	cpx #Kernel				; 2	2
	bne Cache_Kernel_Loop			; 3/2	2
	tya					; 2	1
	tax					; 2	1
	txs					; 2	1
	; 10 + 42 * 22 + 41 + 6 = 981 Clocks 34 bytes of ROM
	; 12 lines + 69 clocks
	
	;;;;;;
	; 4 free clocks
	;;;;;;
	
	sta WSYNC ; 13 lines

Also, fixed the empty stack assumption.

Link to comment
Share on other sites

41 minutes ago, Thomas Jentzsch said:

You save the CPX, that's 2 cycles and 2 bytes.

I was just realizing that. I will have to think about that and how I organize my ZP.

41 minutes ago, Thomas Jentzsch said:

BTW: Why do you need the copying to be that fast? Usually you copy once and then modify only.

Great question, I'm trying to create a few different self-modifying kernels. I'd like to have flexibility in what I can do. on the same screen. I want to explore more advanced SMC kernels such as you and other talented users have shared, but I also want to try SMC with traditional code.

The 115 line kernel is a two-liner that provides:
	1 line resolution to: 
		sprite 0, 1
		
	2 line resolution to:
		background color
		playfield 0, 1, 2
		player color
		ball
		missile 0, 1

May not seem like much but it has unlimited height and allows the playfield to be moved while rendering the players. Any of these can be anywhere on the visible line without restriction. A lot of games that I see limit your horizontal movement because they don't render everything in time to give full-width access. Just seemed like a nice gimmick to have more interactive moving objects while sticking to generally traditional loads. Everything is finished loading in the HBlank and uses the visible scanline to fetch ROM values and modify itself for the next iteration. You have to specify the width/repeat of the sprites, the width of the missiles, and the playfield color ahead of time. I just wanted to see how much stuff I could cram into a kernel that more or less used the Atari as designed.

 

So, I can cache this inside of the frame. Then, move onto a different kernel 15 or so lines later.

 

Edit: I was unable to achieve this without the SMC. I'm always happy to learn something by being disproven.

  • Like 2
Link to comment
Share on other sites

7 hours ago, MarcoJ said:

@bent_pin a great discipline you're taking up. You mentioned before you're teaching a class, are you planning to make a game too? 

I currently teach each student individually, online. 2600 assembly has been part of the course for years, but just the super basic games like pong and combat. It's a great way to introduce relatively simple assembly before moving on to microcontrollers. Ideally, I start with each student between 7 and 9, with a goal of understanding basic abstract math and structured programming within 2 years.

 

I have made many games but they are all deritive works and not worth sharing. However, when I saw Mr Run and Jump and how much it cost, I decided to make my own stick figure adventure game. I had quite a bit finished but upon learning that self modifying code was possible at this level, with only 128 bytes, I pitched my game and started over with the intention of really pushing my limits.

 

Still making the same game. The main character is called stick. Stick listens to the invasive thoughts and ends up in situations, Sticky Situations, and you have to keep saving him. It's a single player game with a 16x20 pixel player sprite. He can run, jump, climb, slide, wall-kick, fall, and swim.

  • Like 3
Link to comment
Share on other sites

  • 2 weeks later...

@RevEng @Thomas Jentzsch

 

Here is a final working version that uses both of your suggestions:

	; Preserve stack pointer				; CLK	ROM
	tsx							; 2	1
	txa							; 2	1
	tay							; 2	1
	
	; Set sp to the end of zp kernel space
	ldx #Kernel - 1 + #End_Kernel - #RAM_Kernel		; 2	3
	txs							; 2	1
	
Cache_the_Kernel:
	lda $fdff,X						; 4	3
	pha							; 3	1
	lda $fdfe,X						; 4	3
	pha							; 3	1
	lda $fdfd,X						; 4	3
	pha							; 3	1
	lda $fdfc,X						; 4	3
	pha							; 3	1
	lda $fdfb,X						; 4	3
	pha							; 3	1
	tsx							; 2	1
	bmi Cache_the_Kernel					; 3/2	2
	; Runs as far down as zp$7c
		
	; Restore the stack pointer
	tya							; 2	1
	tax							; 2	1
	txs							; 2	1
	
	; 10 + n x 8 + 39 + 6
	; around 8.6 clocks per byte
	; around 8.8 bytes per scanline
	; 33 bytes of ROM space

 

Am I creating page boundary issues with my lda addressing the end of the previous page? I could just change the ROM position by -5 to be able to raise the address starting points.

Any other suggestions?

 

Thanks for all the tips so far.

Link to comment
Share on other sites

35 minutes ago, glurk said:

Really negligible,  but tsx, stx TEMP / ldx TEMP, txs would be a tiny bit quicker if you have a TEMP zp available.

I'm not sure that I get it. Could you please elaborate a bit?

 

Edit: I see, for preserving the stack pointer, looks like 2 extra clocks. Every bit helps. 

Link to comment
Share on other sites

4 hours ago, Thomas Jentzsch said:

If the stackpointer value is always the same, you can skip saving it.

Good point, I'll keep that in mind when using this. I moved the kernel to the beginning of zp. I need to consider the game as a whole to decide how I will handle the remaining variables. I may ditch the stack other than subroutines.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...