ilmenit Posted April 23, 2020 Share Posted April 23, 2020 Hi, I wrote a "small" article about optimizing C code for CC65 compiler. Let me know if you have any feedback or additional questions. Enjoy reading! https://github.com/ilmenit/CC65-Advanced-Optimizations 14 11 Quote Link to comment Share on other sites More sharing options...
zbyti Posted April 23, 2020 Share Posted April 23, 2020 Great stuff! Exceptional, how long and wide the Internet is! :] 1 Quote Link to comment Share on other sites More sharing options...
danwinslow Posted April 23, 2020 Share Posted April 23, 2020 Very nice stuff, Ilmenit. I learned a lot from you over the years. In particular, I wrote an (atari st) mouse handler for the A8 completely in CC65, using your techniques. It was an interrupt handler off of one of the pokey timers, and worked great. 1 Quote Link to comment Share on other sites More sharing options...
drludos Posted April 23, 2020 Share Posted April 23, 2020 Woaw, this is incredibly helpful, thanks a lot for taking the time to write it down! I'm not programming for the Atari 5200 (yet?), but I use CC65 for making Lynx and NES games, and this will be really helpful to increase the speed of my games! (the struct of array vs array of struct case for example, i would never have thought of that!) 1 Quote Link to comment Share on other sites More sharing options...
dmsc Posted April 24, 2020 Share Posted April 24, 2020 Hi! 15 hours ago, ilmenit said: Hi, I wrote a "small" article about optimizing C code for CC65 compiler. Let me know if you have any feedback or additional questions. Enjoy reading! https://github.com/ilmenit/CC65-Advanced-Optimizations Many thanks! Now I can link to this document each time someone asks related quations I think you should consider about the parameters to functions - sometimes allowing one parameter to a function actually produces faster and smaller code: - at each call site, the compiler can just emit "LDA / LDX" to load the parameter, instead of the LDA/STA when the parameter is passed in static variables, - but the code generator insists on pushing the received value to the stack at function init. Even so, for functions called a lot this is a win. IMHO, one of the most important optimizations in the compiler is to avoid using the stack on leaf functions, there are two uses of the stack that could be avoided: - using the stack to store the passed arguments in A/X; - using the stack to save the value of ZP "registers", as those are call-saved. Both could be replaced by allocating a small static area per function and storing the values there. Have Fun! Quote Link to comment Share on other sites More sharing options...
ebiguy Posted April 24, 2020 Share Posted April 24, 2020 Excellent ! Bravo !!! Thank you very much. I like very much CC65 and your article boosts what can be achieved with it. Quote Link to comment Share on other sites More sharing options...
ilmenit Posted April 24, 2020 Author Share Posted April 24, 2020 8 hours ago, dmsc said: I think you should consider about the parameters to functions - sometimes allowing one parameter to a function actually produces faster and smaller code: - at each call site, the compiler can just emit "LDA / LDX" to load the parameter, instead of the LDA/STA when the parameter is passed in static variables, - but the code generator insists on pushing the received value to the stack at function init. Even so, for functions called a lot this is a win. CC65 allows "fastcall" calling convention that is passing data through A/X registers: https://github.com/cc65/wiki/wiki/Parameter-passing-and-calling-conventions However when I was testing it, it always generated "jsr pushax" at the beginning (as you wrote) which negates benefit of passing through registers and when the code was benchmarked there was no measurable benefit. Do you have some example of function where it brings boost? I could add such section to the guide. Quote Link to comment Share on other sites More sharing options...
dmsc Posted April 25, 2020 Share Posted April 25, 2020 Hi! On 4/24/2020 at 5:43 AM, ilmenit said: CC65 allows "fastcall" calling convention that is passing data through A/X registers: https://github.com/cc65/wiki/wiki/Parameter-passing-and-calling-conventions However when I was testing it, it always generated "jsr pushax" at the beginning (as you wrote) which negates benefit of passing through registers and when the code was benchmarked there was no measurable benefit. Do you have some example of function where it brings boost? I could add such section to the guide. It would normally be slower, but if you call the function a lot, the program will be shorter. There is a trick you can use to automatically save the function argument, but needs the called function in a separated C file from the caller: // This is in the "fun.c" file: // Defines the function as "void" but using ASM you move the value in A into the local variable unsigned char fun(void) { static unsigned char x; __asm__ ("sta %v", x); // Your function here return 0xFF^x; } // This will also work with integer arguments, movin A and X into the local variable unsigned fun16(void) { static unsigned x; __asm__ ("sta %v", x); __asm__ ("stx %v", x+1); // Your function here return 0xFF^x; } Now, in a separate file, you declare the function with the arguments: // This is in "main.c" file: // Declare the functions with arguments unsigned char fun(unsigned char x); unsigned fun16(unsigned x); int main() { // Call the functions! return fun(12) + fun16(7); } In this case, this is the code produced for the "fun.c" file: ; --------------------------------------------------------------- ; unsigned char __near__ fun (void) .segment "CODE" .proc _fun: near .segment "BSS" L0002: .res 1,$00 .segment "CODE" sta L0002 eor #$FF ldx #$00 rts .endproc ; --------------------------------------------------------------- ; unsigned int __near__ fun16 (void) .segment "CODE" .proc _fun16: near .segment "BSS" L0007: .res 2,$00 .segment "CODE" sta L0007 stx L0007+1 eor #$FF rts .endproc As you see, CC65 even knows that the values are already on X and A, so it does not need to reload them. Have Fun! 1 1 Quote Link to comment Share on other sites More sharing options...
Yaron Nir Posted April 28, 2020 Share Posted April 28, 2020 On 4/23/2020 at 12:28 PM, ilmenit said: Hi, I wrote a "small" article about optimizing C code for CC65 compiler. Let me know if you have any feedback or additional questions. Enjoy reading! https://github.com/ilmenit/CC65-Advanced-Optimizations @ilmenit, you did an excellent work. and this could become a refernce for all who wish to program well in CC65. job well done! 1 Quote Link to comment Share on other sites More sharing options...
devwebcl Posted April 29, 2020 Share Posted April 29, 2020 Great job! 1 Quote Link to comment Share on other sites More sharing options...
clth Posted May 7, 2020 Share Posted May 7, 2020 Thanks alot, looks really handy. Definitely will save lot of time instead of reinventing wheel 1 Quote Link to comment Share on other sites More sharing options...
clth Posted January 21, 2021 Share Posted January 21, 2021 (edited) There is quite recent cc65 feature which is I believe worthy adding to your optimization guide. For some reason ZP placed variables are accessed by two byte address instead of just one byte - https://github.com/cc65/cc65/issues/917 Custom fix mentioned in https://github.com/cc65/cc65/issues/917#issuecomment-647326244 was merged to master branch around end of year(is needed to get fresh sources, not release archive). With this cc65 build I was able to get in my project about 700 bytes of ram by just placing global variables in ZP, there is some noticeable performance boost too. Edited January 21, 2021 by clth 1 Quote Link to comment Share on other sites More sharing options...
ilmenit Posted January 21, 2021 Author Share Posted January 21, 2021 (edited) Hi. Thanks for pointing this out! I will need to check how well it works. I was proposing already to use ZPSYM to make sure that single-byte addressing is used. Edited January 21, 2021 by ilmenit Quote Link to comment Share on other sites More sharing options...
thank you Posted January 26, 2021 Share Posted January 26, 2021 (edited) I couldn't find 'benchmarks.h'. Is my implementation wrong? I'm getting 533 ticks instead of 528 at Step 01 of your guide, @ilmenit This is 'git clone' current version. i added to your example-- typedef unsigned int word; word ticks; void start_benchmark(void) { ticks = PEEKW(18); } void end_benchmark(void) { printf("Ticks used: %d\n", PEEKW(18) - ticks + PEEK(20)); } sorry if obvious or beginner error. edit: 'git clone' of cc65, not your repo... and should have said I also do "#include <peekpoke.h>". Edited January 26, 2021 by thank you forgot something Quote Link to comment Share on other sites More sharing options...
ilmenit Posted January 26, 2021 Author Share Posted January 26, 2021 4 hours ago, thank you said: I couldn't find 'benchmarks.h'. Is my implementation wrong? I'm getting 533 ticks instead of 528 at Step 01 of your guide, @ilmenit This is 'git clone' current version. i added to your example-- typedef unsigned int word; word ticks; void start_benchmark(void) { ticks = PEEKW(18); } void end_benchmark(void) { printf("Ticks used: %d\n", PEEKW(18) - ticks + PEEK(20)); } sorry if obvious or beginner error. edit: 'git clone' of cc65, not your repo... and should have said I also do "#include <peekpoke.h>". The "benchmark.h" is in my repo e.g. https://github.com/ilmenit/CC65-Advanced-Optimizations/blob/master/03-smallest unsigned data types/benchmark.h 533 vs 528 is a very small difference. While it may depend on version of the compiler or the selected compilation options, remember that timer at memory location 20 has value 0-255. You are not zeroing it in your code, therefore the final result may differ by +0 to +255 depending at the moment you run the code. Quote Link to comment Share on other sites More sharing options...
sanny Posted January 26, 2021 Share Posted January 26, 2021 5 hours ago, thank you said: I couldn't find 'benchmarks.h'. Is my implementation wrong? I'm getting 533 ticks instead of 528 at Step 01 of your guide, @ilmenit This is 'git clone' current version. i added to your example-- typedef unsigned int word; word ticks; void start_benchmark(void) { ticks = PEEKW(18); } void end_benchmark(void) { printf("Ticks used: %d\n", PEEKW(18) - ticks + PEEK(20)); } sorry if obvious or beginner error. edit: 'git clone' of cc65, not your repo... and should have said I also do "#include <peekpoke.h>". Instead of using PEEK() functions you could use clock(). Then it wouldn't look so BASIC-like. ? Quote Link to comment Share on other sites More sharing options...
thank you Posted January 27, 2021 Share Posted January 27, 2021 thanks @ilmeniti made it through the lesson, very interesting results... I learned a lot. Somehow I missed the handy links at the top of the page to the various steps of the code, and I am as bad at searching on github as i am on this forum. @sanny i should probably RTFM thanks Quote Link to comment Share on other sites More sharing options...
Harry Potter Posted January 29, 2021 Share Posted January 29, 2021 Hi! I have my own cc65 code optimizations to share. They are at https://sourceforge.net/projects/cc65extra/files/. I thank you for work, and I hope you appreciate mine. Quote Link to comment Share on other sites More sharing options...
clth Posted February 3, 2021 Share Posted February 3, 2021 Ahem, hardly an advanced one but worthy being mentioned somewhere. When trying to improve cc65 random number generator output, i've started to set seed every loop by bit more random value from DLI. Original code, simple line copied from somewhere srand((unsigned) time(NULL)); Updated variant srand(dli_variable); Using time() means extra 1700+ bytes consumed. I do use almost no library stuff but this one slipped through. Quote Link to comment Share on other sites More sharing options...
ilmenit Posted February 3, 2021 Author Share Posted February 3, 2021 Until you need a "deterministic RNG" like the one with srand/rand even shorter is to use the Pokey RANDOM register ? 1 Quote Link to comment Share on other sites More sharing options...
zbyti Posted April 28, 2021 Share Posted April 28, 2021 https://barrgroup.com/embedded-systems/how-to/efficient-c-code 1 Quote Link to comment Share on other sites More sharing options...
ilmenit Posted April 28, 2021 Author Share Posted April 28, 2021 Comparison of different C compilers (cc65, vbcc, kickc, gcc + asm): https://www.videogamesage.com/topic/762-super-tilt-bro-for-nes/page/2/?tab=comments#comment-163145 I didn't read through it yet, just sharing for now. 2 Quote Link to comment Share on other sites More sharing options...
zbyti Posted April 28, 2021 Share Posted April 28, 2021 Quote If you didn't read "Advanced optimizations in CC65" by now (shame on you!), you don't know how the optimized code is a mess to read. All available tricks have been used, even the author does not recommend going so far in real life. 1 Quote Link to comment Share on other sites More sharing options...
ivop Posted April 28, 2021 Share Posted April 28, 2021 2 hours ago, ilmenit said: Comparison of different C compilers (cc65, vbcc, kickc, gcc + asm): https://www.videogamesage.com/topic/762-super-tilt-bro-for-nes/page/2/?tab=comments#comment-163145 I didn't read through it yet, just sharing for now. Nice to see the author of the 6502 gcc backend answered, and even fixed a couple of bugs! Perhaps I should resurrect the Atari 8-bit port again, and have it merged as soon as possible, so it'll track the latest gcc sources. 1 Quote Link to comment Share on other sites More sharing options...
vbc Posted April 28, 2021 Share Posted April 28, 2021 10 hours ago, ilmenit said: Comparison of different C compilers (cc65, vbcc, kickc, gcc + asm): https://www.videogamesage.com/topic/762-super-tilt-bro-for-nes/page/2/?tab=comments#comment-163145 I didn't read through it yet, just sharing for now. As this blog also got referenced in another forum, I will add my observations regarding this comparison here as well: Someone pointed me to this comparison some time ago, because it seemed to mention a bug in my compiler (vbcc). Trying to verify this was made somewhat tedious, because the author of this comparison uses his own simulator and, in the case of vbcc, his own linker scripts and configuration files. After having a short look I found that the test that did not work with vbcc uses the pages 0x300 and 0x400 to write the results in a simulated frame-buffer or something like that. However his vbcc linker files contains: MEMORY { ... ram: org=0x0300, len=0x0500 } SECTIONS { ... data: {*(data)} >ram AT>out ... bss (NOLOAD): {*(bss)} >ram ... } I did not investigate further, but putting the data and bss section in the frame buffer does seem suspicious to me. As I did not want to waste much time with the tinkered configs, I slightly adapted the code to the C64 screen buffer and the result compiled with vbcc for C64 looked very similar to the one compiled with cc65. Strangely however, the player "sprite" only showed with cc65. Further investigation showed that the test code used an uninitialized variable for the y-coordinate of the player. After fixing this bug in the test, the result of vbcc exactly matched that of cc65. When I tried to add a timer variable to measure run-time on the C64, the code did not compile anymore on cc65, because it exceeded the 256 byte limitation of cc65. Apparently the test was exactly tailored to cc65's limitations whereas the vbcc result was basically sabotaged. It is obviously not an unbiased comparison but rather the author started with code for cc65 and did no further investigations when the code did not work with a compiler he personally dislikes (whereas for gcc which apparently generated actually broken code he even went out of his way to fix the assembly code by hand). Using this approach your daily-use compiler will of course tend to look more stable. That does not really say much. As I wrote above, I am the author of the compiler that the author of this comparison hates, so obviously I am not unbiased as well. When I checked his article, I did not write anything and I did not really want to get involved (and write lengthy posts like this one). However, if people point to this blog, I have to say that after what I have checked so far, it is my firm (and as my findings hopefully show mostly fact-based) opinion that this comparison is much too flawed to base a compiler decision on. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.