+bent_pin Posted July 28, 2023 Share Posted July 28, 2023 I have a 115 byte kernel to cache and the loop takes quite a bit of time: ; RAM_Kernel is in ROM ; Kernel is in Zeropage Cache_the_Kernel: ; CLK ROM ldx #End_Kernel-RAM_Kernel ; 2 2 Cache_Kernel_Loop: lda RAM_Kernel-1,X ; 4 3 sta Kernel-1,X ; 4 2 dex ; 2 1 bne Cache_Kernel_Loop ; 3/2 2 ; 2 + 13 * 112 + 12 = 1470 10 ; 19 lines + 26 clocks ;;;;;; ; 47 free clocks ;;;;;; sta WSYNC ; 20 lines So I took a different tack: Cache_the_Kernel: ; CLK ROM ldx #End_Kernel-RAM_Kernel ; 2 2 Cache_Kernel_Loop: lda RAM_Kernel-1,X ; 4 3 sta Kernel-1,X ; 4 2 lda RAM_Kernel-2,X ; 4 3 sta Kernel-2,X ; 4 2 lda RAM_Kernel-3,X ; 4 3 sta Kernel-3,X ; 4 2 lda RAM_Kernel-4,X ; 4 3 sta Kernel-4,X ; 4 2 lda RAM_Kernel-5,X ; 4 3 sta Kernel-5,X ; 4 2 txa ; 2 1 sbc #5 ; 2 2 tax ; 2 1 bne Cache_Kernel_Loop ; 3/2 2 ; 2 + 49 * 22 + 48 = 1128 33 Bytes ; 14 lines + 64 clocks ;;;;;; ; 8 free clocks ;;;;;; sta WSYNC ; 15 lines I'll take a speed increase like that any day. What are some faster or different ways of achieving the same goal? Quote Link to comment Share on other sites More sharing options...
RevEng Posted July 28, 2023 Share Posted July 28, 2023 You could use PHA in place of the indexed writes, and you'll save 1 cycle per write. You could also save 2 cycles on your X adjustment if you're willing to use undocumented opcodes. (LDA #$FF, SBX #5) [edit: TXA; SBX #5 would be 1 byte less] 1 Quote Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted July 28, 2023 Share Posted July 28, 2023 (edited) You could do: Cache_the_Kernel ; CLK ROM ldx #(End_Kernel-RAM_Kernel)/5-1 ; 2 2 Cache_Kernel_Loop lda RAM_Kernel,X ; 4 3 sta Kernel,X ; 4 2 lda RAM_Kernel+115/5*1,X ; 4 3 sta Kernel+115/5*1,X ; 4 2 lda RAM_Kernel+115/5*2,X ; 4 3 sta Kernel+115/5*2,X ; 4 2 lda RAM_Kernel+115/5*3,X ; 4 3 sta Kernel+115/5*3,X ; 4 2 lda RAM_Kernel+115/5*4,X ; 4 3 sta Kernel+115/5*4,X ; 4 2 dex ; 2 1 bpl Cache_Kernel_Loop ; 3 2 But using PHA is even more efficient here. BTW: Your code needs one SEC before the loop. Edited July 28, 2023 by Thomas Jentzsch 1 Quote Link to comment Share on other sites More sharing options...
+bent_pin Posted July 28, 2023 Author Share Posted July 28, 2023 Edit: Wait one, made a big mistake. Code to follow. Quote Link to comment Share on other sites More sharing options...
+bent_pin Posted July 28, 2023 Author Share Posted July 28, 2023 ; RAM_Kernel org = new page + Kernel ; Places them at the correct offset to use ; the stack pointer as the X offset in ROM ; This assumes an empty stack Cache_the_Kernel: ; CLK ROM ldx #Kernel+End_Kernel-RAM_Kernel ; 2 2 txs ; 2 1 Cache_Kernel_Loop: lda RAM_Kernel-1,X ; 4 3 pha ; 3 1 lda RAM_Kernel-2,X ; 4 3 pha ; 3 1 lda RAM_Kernel-3,X ; 4 3 pha ; 3 1 lda RAM_Kernel-4,X ; 4 3 pha ; 3 1 lda RAM_Kernel-5,X ; 4 3 pha ; 3 1 tsx ; 2 1 cpx #Kernel ; 2 2 bne Cache_Kernel_Loop ; 3/2 2 ldx #ff ; 2 2 txs ; 2 1 ; 4 + 42 * 22 + 41 + 4 = 973 Clocks 29 bytes of ROM ; 12 lines + 61 clocks ;;;;;; ; 12 free clocks ;;;;;; sta WSYNC ; 13 lines Ok, I think that's the best that I can make it. I considered all the wonderful advice, and it made great improvements. The principle is that the code in ROM shares the same relative address as the Kernel space in ZP. That way, when SP is transferred to X to be compared, it's also suitable as the X offset for the next iteration of the loop. This code is untested, but I think it should work. Any mistakes or suggestions? You all rock, thanks for your help. Quote Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted July 28, 2023 Share Posted July 28, 2023 If the code is at the beginning of the ZP-RAM you can use BMI after TSX instead. 1 Quote Link to comment Share on other sites More sharing options...
+bent_pin Posted July 28, 2023 Author Share Posted July 28, 2023 22 minutes ago, Thomas Jentzsch said: If the code is at the beginning of the ZP-RAM you can use BMI after TSX instead. I put my variables in the beginning of RAM, then the Kernel space, then the stack. That way when I have smaller kernels I can use more of the stack. Is that wrong-think? Would BMI provide a benefit to speed or ROM size? ; RAM_Kernel org = new page + Kernel ; Places them at the correct offset to use ; the stack pointer as the X offset in ROM Cache_the_Kernel: ; CLK ROM tsx ; 2 1 txa ; 2 1 tay ; 2 1 ldx #Kernel+End_Kernel-RAM_Kernel ; 2 2 txs ; 2 1 Cache_Kernel_Loop: lda RAM_Kernel-1,X ; 4 3 pha ; 3 1 lda RAM_Kernel-2,X ; 4 3 pha ; 3 1 lda RAM_Kernel-3,X ; 4 3 pha ; 3 1 lda RAM_Kernel-4,X ; 4 3 pha ; 3 1 lda RAM_Kernel-5,X ; 4 3 pha ; 3 1 tsx ; 2 1 cpx #Kernel ; 2 2 bne Cache_Kernel_Loop ; 3/2 2 tya ; 2 1 tax ; 2 1 txs ; 2 1 ; 10 + 42 * 22 + 41 + 6 = 981 Clocks 34 bytes of ROM ; 12 lines + 69 clocks ;;;;;; ; 4 free clocks ;;;;;; sta WSYNC ; 13 lines Also, fixed the empty stack assumption. Quote Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted July 28, 2023 Share Posted July 28, 2023 You save the CPX, that's 2 cycles and 2 bytes. BTW: Why do you need the copying to be that fast? Usually you copy once and then modify only. 1 Quote Link to comment Share on other sites More sharing options...
+bent_pin Posted July 28, 2023 Author Share Posted July 28, 2023 41 minutes ago, Thomas Jentzsch said: You save the CPX, that's 2 cycles and 2 bytes. I was just realizing that. I will have to think about that and how I organize my ZP. 41 minutes ago, Thomas Jentzsch said: BTW: Why do you need the copying to be that fast? Usually you copy once and then modify only. Great question, I'm trying to create a few different self-modifying kernels. I'd like to have flexibility in what I can do. on the same screen. I want to explore more advanced SMC kernels such as you and other talented users have shared, but I also want to try SMC with traditional code. The 115 line kernel is a two-liner that provides: 1 line resolution to: sprite 0, 1 2 line resolution to: background color playfield 0, 1, 2 player color ball missile 0, 1 May not seem like much but it has unlimited height and allows the playfield to be moved while rendering the players. Any of these can be anywhere on the visible line without restriction. A lot of games that I see limit your horizontal movement because they don't render everything in time to give full-width access. Just seemed like a nice gimmick to have more interactive moving objects while sticking to generally traditional loads. Everything is finished loading in the HBlank and uses the visible scanline to fetch ROM values and modify itself for the next iteration. You have to specify the width/repeat of the sprites, the width of the missiles, and the playfield color ahead of time. I just wanted to see how much stuff I could cram into a kernel that more or less used the Atari as designed. So, I can cache this inside of the frame. Then, move onto a different kernel 15 or so lines later. Edit: I was unable to achieve this without the SMC. I'm always happy to learn something by being disproven. 2 Quote Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted July 29, 2023 Share Posted July 29, 2023 That's pretty ambitious. I hope it works. Looking forward to the result. 1 Quote Link to comment Share on other sites More sharing options...
+MarcoJ Posted July 29, 2023 Share Posted July 29, 2023 @bent_pin a great discipline you're taking up. You mentioned before you're teaching a class, are you planning to make a game too? 1 Quote Link to comment Share on other sites More sharing options...
+bent_pin Posted July 29, 2023 Author Share Posted July 29, 2023 7 hours ago, MarcoJ said: @bent_pin a great discipline you're taking up. You mentioned before you're teaching a class, are you planning to make a game too? I currently teach each student individually, online. 2600 assembly has been part of the course for years, but just the super basic games like pong and combat. It's a great way to introduce relatively simple assembly before moving on to microcontrollers. Ideally, I start with each student between 7 and 9, with a goal of understanding basic abstract math and structured programming within 2 years. I have made many games but they are all deritive works and not worth sharing. However, when I saw Mr Run and Jump and how much it cost, I decided to make my own stick figure adventure game. I had quite a bit finished but upon learning that self modifying code was possible at this level, with only 128 bytes, I pitched my game and started over with the intention of really pushing my limits. Still making the same game. The main character is called stick. Stick listens to the invasive thoughts and ends up in situations, Sticky Situations, and you have to keep saving him. It's a single player game with a 16x20 pixel player sprite. He can run, jump, climb, slide, wall-kick, fall, and swim. 3 Quote Link to comment Share on other sites More sharing options...
+bent_pin Posted August 9, 2023 Author Share Posted August 9, 2023 @RevEng @Thomas Jentzsch Here is a final working version that uses both of your suggestions: ; Preserve stack pointer ; CLK ROM tsx ; 2 1 txa ; 2 1 tay ; 2 1 ; Set sp to the end of zp kernel space ldx #Kernel - 1 + #End_Kernel - #RAM_Kernel ; 2 3 txs ; 2 1 Cache_the_Kernel: lda $fdff,X ; 4 3 pha ; 3 1 lda $fdfe,X ; 4 3 pha ; 3 1 lda $fdfd,X ; 4 3 pha ; 3 1 lda $fdfc,X ; 4 3 pha ; 3 1 lda $fdfb,X ; 4 3 pha ; 3 1 tsx ; 2 1 bmi Cache_the_Kernel ; 3/2 2 ; Runs as far down as zp$7c ; Restore the stack pointer tya ; 2 1 tax ; 2 1 txs ; 2 1 ; 10 + n x 8 + 39 + 6 ; around 8.6 clocks per byte ; around 8.8 bytes per scanline ; 33 bytes of ROM space Am I creating page boundary issues with my lda addressing the end of the previous page? I could just change the ROM position by -5 to be able to raise the address starting points. Any other suggestions? Thanks for all the tips so far. Quote Link to comment Share on other sites More sharing options...
glurk Posted August 9, 2023 Share Posted August 9, 2023 Really negligible, but tsx, stx TEMP / ldx TEMP, txs would be a tiny bit quicker if you have a TEMP zp available. 1 Quote Link to comment Share on other sites More sharing options...
+bent_pin Posted August 9, 2023 Author Share Posted August 9, 2023 35 minutes ago, glurk said: Really negligible, but tsx, stx TEMP / ldx TEMP, txs would be a tiny bit quicker if you have a TEMP zp available. I'm not sure that I get it. Could you please elaborate a bit? Edit: I see, for preserving the stack pointer, looks like 2 extra clocks. Every bit helps. Quote Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted August 10, 2023 Share Posted August 10, 2023 If the stackpointer value is always the same, you can skip saving it. 1 Quote Link to comment Share on other sites More sharing options...
+bent_pin Posted August 10, 2023 Author Share Posted August 10, 2023 4 hours ago, Thomas Jentzsch said: If the stackpointer value is always the same, you can skip saving it. Good point, I'll keep that in mind when using this. I moved the kernel to the beginning of zp. I need to consider the game as a whole to decide how I will handle the remaining variables. I may ditch the stack other than subroutines. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.