Jump to content

About This Club

Atari VCS/2600 development using the Harmony/Melody board. See the GENERAL discussion area for setting up your environment. See the CDFJ discussion area for the tutorial.
  1. What's new in this club
  2. Just reading up on CDFJ+ to see how to get up to speed, and came to this page as part of the documentation. I just thought I'd add that for division by known constant, I pretty much let the compiler do all the work for me. I usually allocate 16 bits for fraction and do this... divide_by_5 = (value * (0x10000 / 5)) >> 16; divide_by_17 = (value * (0x10000 / 17)) >> 16; No need for tables or looking up shifts/values.
  3. Interestlingly, Dionoid's size optimized code is not only smaller, but also performs slightly better than the original. The original has some register saving overhead; for some reason it needs one register more than the other simple loop memsets. When I played around with thumb16 code, I found that incrementing the pointer instead of indexing often leads to slightly better loop results. At least with the compiler flags and options used. Here, the optimizer found and used a special ARM instruction (STMIA). This is why the presented alternate size optimized single loop code is two more bytes smaller (12 bytes) and also slightly faster.
  4. The gist: myMemsetInt_br8 always runs faster, right from count 1, and it's average speed quickly increases with the size to fill, until it reaches almost 7 times (6,85) that of the original loop and still nearly 6 times (5.7) that of the fastest single loop in this comparison. To be exact: To set one integer (count 1) takes with 19 cycles (including the call overhead for non-inlined code) roughly the same time as most variants. The original is slower at the beginning as it has some initialization instructions. It needs 32 cycles. At count 8 a little setback takes place, reducing the average speed, however the power of the block loop kicks in now and makes up for it very quickly (see data series and chart). Important here: We are never slower than any of the the simple loops, even in this 'fast block preparation phase'! Some numbers: To fill 7 integers takes less than half the time of the fastest single loop, 17 integers a third, 40 a quarter, 120 a fifth, and to fill/clear 1000 integers only takes a bit over a sixth (5.6) of the time, with lean 1.8 cycles per filled integer. So much for the theory - now let's see it in practice.
  5. Got some questions about how much actually is gained by using the presented methods. As I had no detailled answer to that, only some promising observations in my game effort, here some calculations, based on a data sheet with cycle timings (from ARM, also attached). This might not completely replace actual measurements on real hardware but is probably not too far off to shed some light on it. The attached PDF analyzes cycle count for four relevant variants discussed in the topic and compares them, also graphically. It was done in Excel and I hope the cycle calculations are correct - you might do a countercheck and tell me where I miscounted... (In case you are wondering: The cycle count shown is calculated for a non-inlined function, so it includes the cycle counts for the method call and its return.) In case the PDF doesn't speak for itself and needs some explanation, please let me know as well. myMemsetInt_CycleComparison_4Variants.pdf
  6. Here the corresponding optimized custom memsetInt in native assembler code. The same principle applies as with the assembler memcpy in my last post: The memory is set blockwise, 8 integers (i.e. 32 bit words) per loop pass, using well suited special ARM instructions (thanks, Thomas) up to a point where the remaining number of integers/words is below blocksize. These are worked then without another loop. Once again the aim was a reasonably low overhead for small(ish) sizes of memory to set. Nevertheless, when filling/clearing larger buffers, the method still performs well. It's size is ~48 bytes of arm thumb code. Remember: The destination pointer needs to be 4-aligned. void myMemsetInt_br8(unsigned int* destination, unsigned int fill, int count) { // (C) 2023 oliver gross aka bithopper - use at your own RISC // V0.9.230219_0 asm volatile( "cmp r2, #8\n\t" "blt .LL02\n\t" "push {r4-r6}\n\t" "mov r4, r1\n\t" "mov r5, r1\n\t" "mov r6, r1\n\t" "sub r2, #8\n\t" ".LL01:\n\t" "stmia r0!, {r1, r4-r6}\n\t" "stmia r0!, {r1, r4-r6}\n\t" "sub r2, #8\n\t" "bpl .LL01\n\t" "add r2, #8\n\t" "pop {r4-r6}\n\t" ".LL02:\n\t" "lsl r2, r2, #1\n\t" "mov r3, #13\n\t" "sub r3, r2\n\t" "add pc, r3\n\t" "stmia r0!, {r1}\n\t" "stmia r0!, {r1}\n\t" "stmia r0!, {r1}\n\t" "stmia r0!, {r1}\n\t" "stmia r0!, {r1}\n\t" "stmia r0!, {r1}\n\t" "stmia r0!, {r1}\n\t" : : : "r3", "cc", "memory" ); } The code seems to run fine in my WIP, compiled with GCC, so I decided to post it here. However: This is a beta version! You are welcome to use it, but be warned: It is not thorrowly tested and there might be bugfixes and other updates in near future. Please let me know of any issues or suggestions, feedback is most welcome 🙂
  7. Spent another couple of hours hours at the bytewise memcpy, as the result of the C compiler optimization still left something to desire. After several frustrating attempts I took a deep breath and - decided to write a custom native Thumb16 code memcpy. Maybe it is also of use for you fellow hero (or heroine?!) of the VCS. I reduced the unrolling to 8, now featuring two phases: A faster 'block' copy (although still single value copy instructions, the multiple register load/store functions appear to work only with whole words), followed by 0 to 7 final single byte copies. As the overall size is ~108 bytes, this implementation will not be suited for every VCS project (most certainly not for Dionoid 's Loderunner ). It is still optimized for small copy sizes, though. I aimed for a low overhead when copying just a few bytes. When copying more than a few dozen bytes, we could use another memcpy implementation that aligns source and destination pointers and then ends up in a super fast integer copy. These algorithms are way faster for big copy sizes. However they tend to be even bulkier, with a sobering overhead ratio for the first few bytes (how big are your sprites, again? ) void myMemcpy_br8(unsigned char* destination, unsigned char* source, int count) { // (C) 2023 oliver gross aka bithopper - use at your own RISC // V0.9.230215_1 asm volatile( "cmp r2, #7\n\t" "ble .LL2\n\t" "sub r2, #8\n\t" ".LL3:\n\t" "ldrb r3, [r1]\n\t" "strb r3, [r0]\n\t" "ldrb r3, [r1, #1]\n\t" "strb r3, [r0, #1]\n\t" "ldrb r3, [r1, #2]\n\t" "strb r3, [r0, #2]\n\t" "ldrb r3, [r1, #3]\n\t" "strb r3, [r0, #3]\n\t" "ldrb r3, [r1, #4]\n\t" "strb r3, [r0, #4]\n\t" "ldrb r3, [r1, #5]\n\t" "strb r3, [r0, #5]\n\t" "ldrb r3, [r1, #6]\n\t" "strb r3, [r0, #6]\n\t" "ldrb r3, [r1, #7]\n\t" "strb r3, [r0, #7]\n\t" "add r0, #8\n\t" "add r1, #8\n\t" "sub r2, #8\n\t" "bpl .LL3\n\t" "add r2, #8\n\t" ".LL2:\n\t" "lsl r2, r2, #3\n\t" "mov r3, #55\n\t" "sub r3, r2\n\t" "add pc, r3\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" : : : "r3", "cc", "memory" ); } A word of caution - although it performed nicely in my WIP, using GCC, it is not thorrowly tested, so please regard this as highly experimental code that might still see some updates in near future. (Did I mention it is now highly platform dependend?) If someone uses this method and finds a bug, or has some other suggestions about the ARM code please let me know!
  8. Forgot to update the Tips and Tricks topic with this, it's now in there.
  9. Thanks for sharing this! I've updated the Tips and Tricks topic.
  10. Related to this, for Lode Runner 2600 my biggest challenge was to bring down the number of bytes for the ARM binary, so preferring size over speed. I also experimented with the functions in defines_cdfj.h and found that the next C code for myMemsetInt results in 8 bytes less for the generated ARM binary: void myMemsetInt(unsigned int *destination, int fill, int count) { while (count > 0) destination[--count] = fill; } ...resulting in the ARM assembly code below. For some reason it looks more verbose than the original ARM assembly code for myMemsetInt, but it should use 8 bytes less ARM binary: myMemsetInt: .L1: cmp r2, #0 bgt .L2 bx lr .L2: subs r2, r2, #1 lsls r3, r2, #2 str r1, [r0, r3] b .L1
  11. No idea, sorry. I suppose using inline Assembler might be the only option here.
  12. Thanks, Thomas! Loading/storing multiple registers sure is a good way to further speed up things. If I remember correctly, this was even possible with 68000. Do you have any idea of how to 'persuade' the compiler optimizer to make use of it? Up to now, the code is platform independend...
  13. A word about the algorithm and it's C implementation: As mentioned, while the technique is well known to some, it's implementation in C proved a bit of a challenge. I tried different approaches and rejected all but the presented for various reasons: Generally speaking, the ultimate goal was C code that compiles to perfect ARM code, meaning you could not find a better solution when writing directly in ARM assembler. At least, we would want to come reasonably close. This means we must write some C code that nudges the C compiler's optimizer into the right direction. The best results I got was with using a non standard C feature called 'labels as values' that is common in GCC and other compilers. This allows for calculated jumps and is used in the presented methods. Other approaches in pure standard C, utilizing the 'switch' statement like in Duff's Device, which I found afterwards in Wikipedia, resulted in considerably worse compilation results, at least in my tests. Would be interesting to see, if some ARM code, natively written in assembler, could still be any faster somehow. Please let me know!
  14. While optimizing a few C methods in my current CDFJ game project, I realized just how often I use the standard myMemcpy and myMemset functions provided in the file 'defines_cdfj.h'. These are are basic loops that work fine, and have a small memory footprint. I was using them all over the place, for initalization routines, sprite graphics, etc. For example: void myMemsetInt(unsigned int* destination, int fill, int count) { int i; for (i=0; i<count; ++i) { destination[i] = fill; } } The resulting ARM code the compiler produces looks something like: myMemsetInt: mov r3, #0 .L2: cmp r3, r2 bxge lr str r1, [r0, r3, lsl #2] add r3, r3, #1 b .L2 Without knowing too much about ARM coding: As rule of thumb we can assume each line is executed by the ARM processor in one cycle, here. (Some commands will take longer, multiplications, for example.) So we see 5 lines running in a loop, where only one of these lines is actually setting a value in memory, the other 4 are loop related, which is a pretty hefty overhead, isn't it? Maybe we can do better? (Note: Actually the compiler will produce Thumb16 code, which looks even worse). All of a sudden I remembered how you sped up screen clearing in 68000 assembler on the Atari ST back in those days and I realized: By unrolling the loop we will get rid of much of this overhead! OK, let's do it. Lets assume we want to set 8 values somewhere in memory. So we can get rid of the loop completely by writing something like this: void myMemsetInt_8(unsigned int* destination, int fill) { *destination++ = fill; *destination++ = fill; *destination++ = fill; *destination++ = fill; *destination++ = fill; *destination++ = fill; *destination++ = fill; *destination++ = fill; } Which compiles to: myMemsetInt_8: str r1, [r0] str r1, [r0, #4] str r1, [r0, #8] str r1, [r0, #12] str r1, [r0, #16] str r1, [r0, #20] str r1, [r0, #24] str r1, [r0, #28] bx lr That's better, right? Just 9 commands required for setting 8 integers in the memory - as opposed to roughly 5 * 8 = 40 with the simple loop. OK, we did it. End of story. Err, wait What we really need is a more general solution with reasonable memory footprint, that is capable of setting arbitrary sizes of memory, just like the standard loop, but still getting good benefit from this principle of unrolling. After some thinking about how to do certain things in C - I ended up writing this piece of code: void myMemsetInt_unrolled16(unsigned int* destination, int fill, int count) { static const void* jmparray[] = {&&L_16,&&L_15,&&L_14,&&L_13,&&L_12,&&L_11,&&L_10,&&L_09,&&L_08,&&L_07,&&L_06,&&L_05,&&L_04,&&L_03,&&L_02,&&L_01}; // NOTE: GCC -Os will hopefully optimize this code so that it computes the address rather than actually builds the table! loop: if(count < 16) goto *jmparray[count]; // Use unrolled fill for <count> Integers. *destination++ = fill; L_01: *destination++ = fill; L_02: *destination++ = fill; L_03: *destination++ = fill; L_04: *destination++ = fill; L_05: *destination++ = fill; L_06: *destination++ = fill; L_07: *destination++ = fill; L_08: *destination++ = fill; L_09: *destination++ = fill; L_10: *destination++ = fill; L_11: *destination++ = fill; L_12: *destination++ = fill; L_13: *destination++ = fill; L_14: *destination++ = fill; L_15: *destination++ = fill; L_16: if(count <= 16) return; // DONE count -= 16; goto loop; } The compiler produced this: myMemsetInt_unrolled16: ldr r3, .L22 .L2: cmp r2, #15 ldrle pc, [r3, r2, lsl #2] @ indirect memory jump str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 cmp r2, #16 bxle lr sub r2, r2, #16 b .L2 .L22: .word .LANCHOR0 Doesn't look too bad in regard of loop overhead, right? You will realize the bigger size, though. Sure, the longer you unroll it, the better the overhead ratio but also the bigger the code size. I am very aware of the memory restrictions on the VCS, but when using bank switching models like CDFJ with 32K I guess the speed boost might well be worth the memory trade off. If the unrolling seems too excessive, you can adapt it to any other number. Maybe 8 will be a good compromise between loop overhead and memory size. The 'unrolling factor' of 16 I used in this code just gives me the good feeling that i don't waste any loop overhead on my sprites, as most of them are not as tall :)) This idea is also applicable with other loops, for example the byte wise myMempy: void myMemcpy_unrolled16(unsigned char* destination, unsigned char* source, int count) { static const void* jmparray[] = {&&L_16,&&L_15,&&L_14,&&L_13,&&L_12,&&L_11,&&L_10,&&L_09,&&L_08,&&L_07,&&L_06,&&L_05,&&L_04,&&L_03,&&L_02,&&L_01}; // NOTE: GCC -Os hopefully will optimize this code so that it computes the address rather than actually builds the table! loop: if(count < 16) goto *jmparray[count]; // Use unrolled copy for <count> bytes. L_00: *destination++ = *source++; L_01: *destination++ = *source++; L_02: *destination++ = *source++; L_03: *destination++ = *source++; L_04: *destination++ = *source++; L_05: *destination++ = *source++; L_06: *destination++ = *source++; L_07: *destination++ = *source++; L_08: *destination++ = *source++; L_09: *destination++ = *source++; L_10: *destination++ = *source++; L_11: *destination++ = *source++; L_12: *destination++ = *source++; L_13: *destination++ = *source++; L_14: *destination++ = *source++; L_15: *destination++ = *source++; L_16: if(count <= 16) return; // DONE count -= 16; goto loop; } As mentioned in the code comments: The jump tables are only provided to let the compilers optimizer know about our intentions. If the optimizer does it's job, these tables are not in the result. The algorithm is optimized for small sizes, i.e. it is designed in a way that, compared to the simple loop, it has almost no overhead when setting just a single value. From the second integer on you get the sheer power of unrolling. When I showed the optimized copy and set methods to Darrell, he encouraged me to this post. May someone find it useful as well - and now it's time to copy some more sprites >> UPDATE: For some custom native ARM Thumb16 implementations of myMemcpy and myMemsetInt, please look further down <<
  15. I've tracked down the Stay Frosty 2 music driver I ported for John for Mappy. It's using an old version of the CDF driver, so I'll need to update it to CDFJ before posting it. In regards to this, the performance hit is less than 10%. Your custom ARM code is idled once per scanline for this sequence of 6507 opcodes: 0-2 cycles waiting for the currently executing NOP to finish 2 cycles for LDX #value 3 cycles for STX AUDV0 0 cycles for NOP - custom ARM code resumes after putting this NOP on the bus For 5 - 7 cycles of run time per scanline, which is 6.6% - 9.2%, or a 7.9% performance hit on average. There's also a slight amount of time spent calculating the audio value, but its minuscule in comparison to the time it takes the 6507 to run those opcodes, so we could say about an 8% hit on average.
  16. CDFJ speech demo (sample playback) cdfj_speech.zip When using Sample Mode the CDFJ driver supports playing back digital audio samples. Examples of this can be heard in Draconian. The samples only uses TIA's voice 0, voice 1 is available for your game's sound effects. If you're creative, you can still use voice 0 for sound effects whenever a sample is not playing. This also can be heard in Draconian. In your 6507 code Set the mode to sample update AUDV0 once per scanline call your ARM routines with Audio Interrupt turned on In your ARM code: Start sample Test for end of sample 6507 code Set the mode to sample When updating SETMODE make sure to use the AUDIOSAMPLE constant: ldx #AUDIOSAMPLE|FASTON stx SETMODE update AUDV0 once per scanline Read from AMPLITUDE and update AUDV0 once per scanline. For best audio quality do the update on the same cycle on every scanline, though this is not always possible. lda #AMPLITUDE sta AUDV0 NOTE: lda #AMPLITUDE requires SETMODE to have been set with FASTON constant. WARNING: Do NOT do this, you cannot "read ahead" from AMPLITUDE lda #AMPLITUDE ... any other 6507 code sta AUDV0 call your ARM routines with Audio Interrupt turned on When calling your ARM routines you will want to turn on the Audio Interrupt so the updates to AUDV0 will still occur about once per scanline. ldx #$FE ; Run ARM code with digital audio interrupt stx CALLFN ; Call Function The $FE tells CALLFN to trigger an ARM interrupt about once per scanline. The interrupt routine will extract the next sample value, then output this sequence to the 6507's bus: LDX #value STX AUDV0 NOP And yes, this means the value in the 6507's X register will no longer be $FE after stx CALLFN has finished executing. NOTE: Using the audio interrupt will cause about a 10% hit in performance for your C code. ARM code Start sample To start a sample reset the sample to the beginning, set the playback rate, and set the pointer to the sample's position. resetWave(0); setNote(0, rate); setNote(1, 0); setNote(2, 0); setSamplePtr(sample); Stop playback at end of sample Retrieve getWavePtr(0) to get the current playback position of the sample. Once end has been reached then stop playback (or start a new sample). if (sample_size > getWavePtr(0) >> 21) { // if sample has been fully played stop playback setNote(0, 0); resetWave(0); } Sample info Samples are 4 bit values, which are packed 2 per byte. Upper nybble of the byte is played back before the lower nybble. Max size of the packed sample data is 2K - which is 4096 sample values when the bytes are split into nybbles for playback. This C utility will convert a raw 8-bit audio file to the packed format. A raw file means there's no header information in the file, just digital samples. raw_to_dpc.c I used SOX to create raw 8-bit audio files.
  17. Actually @cd-w's song. He also used it when experimenting with the DPC's music abilities back in 2010: I also did a DPC experiment around the same time frame, though mine was with graphics: Our experiments with the Harmony's support for DPC lead to the creation of DPC+, and eventually CDFJ. Fun!
  18. Couldn't resist playing around with your danube song ... some experiment with dynamic wave forms for the lead voice. cdf1_music.bin
  19. Thanks a million, Darrell, for this perfect starter pack. I feel obliged now to integrate it into my WIP. (...ok, lets begin with finding some extra cycles at a steady scan line position for each kernel line 😱)
  20. CDFJ music demo cdfj_music.zip When using Music Mode the CDFJ driver supports playing back 3 voice music similar to the music in Pitfall 2. Unlike Pitfall 2's square waves only, CDFJ supports custom waveforms for a wider variety of sounds. Examples of this can be heard in Stay Frosty 2 and Mappy. Note: while SF2 uses the DPC+ driver, the ARM enhanced audio is basically the same as when using the CDFJ driver. The music only uses TIA's voice 0, voice 1 is available for your game's sound effects. In your 6507 code Set the mode to music update AUDV0 once per scanline call your ARM routines with Audio Interrupt turned on In your ARM code: Copy the waveforms from ROM into Display Data RAM call setWaveform() for each voice call setNote() for each voice 6507 code Set the mode to music When updating SETMODE make sure to use the AUDIOMUSIC constant: ldx #AUDIOMUSIC|FASTON stx SETMODE update AUDV0 once per scanline Read from AMPLITUDE and update AUDV0 once per scanline. For best audio quality do the update on the same cycle on every scanline, though this is not always possible. lda #AMPLITUDE sta AUDV0 NOTE: lda #AMPLITUDE requires SETMODE to have been set with FASTON constant. WARNING: Do NOT do this, you cannot "read ahead" from AMPLITUDE lda #AMPLITUDE ... any other 6507 code sta AUDV0 call your ARM routines with Audio Interrupt turned on When calling your ARM routines you will want to turn on the Audio Interrupt so the updates to AUDV0 will still occur about once per scanline. ldx #$FE ; Run ARM code with digital audio interrupt stx CALLFN ; Call Function The $FE tells CALLFN to trigger an ARM interrupt about once per scanline. The interrupt routine will calculate the current audio value, then output this sequence to the 6507's bus: LDX #value STX AUDV0 NOP And yes, this means the value in the 6507's X register will no longer be $FE after stx CALLFN has finished executing. NOTE: Using the audio interrupt will cause about a 10% hit in performance for your C code. ARM code Copy the waveforms from ROM into Display Data RAM For CDFJ driver performance reasons the waveforms must be located in RAM. The waveforms also need to be aligned on 32 byte boundaries. Search for "Waveform Tables" in cdf1_music.asm for more info and example waveforms. call setWaveform() for each voice C function setWaveform(voice, waveform), located in defines_cdfj.h, is used to set the waveform buffer used by each voice. voice = 0 - 2 waveform = which 32 byte waveform buffer. Basically the address offset into Display Data / 32. call setNote() for each voice C function setNote(voice, frequency), located in defines_cdfj.h, is used to set the frequency of each voice. voice = 0 - 2 frequency = frequency of the note. Use getPitch(note) , located in defines_cdfj.h, to calculate the frequency. note = 1-88 where 1 = A0 and 88 = C7 While the waveforms default to 32 bytes in size, you can change that using setWaveSize(voice, size), also located in defines_cdfj.h. I don't believe anybody has used this feature yet.
  21. Would be interested in the demos, as well. And a music driver would be just fantastic!
  22. TY. I have ADSR in my own effort, but it's not using streams; just 60Hz manipulation of AUDV from hardwired ADSR envelopes, using preset AUDC/AUDF. First time I've ever done anything music related, so I'm on a steep learning curve. It does a surprisingly "good" job of simulating different instruments, though, rudimentary as it is. I'll PM you the latest to have a listen to. I'm somewhat bad at audio stuff, as my ears are nowhere near even slightly capable of hearing this stuff properly.
  23. Hmm, apparently I'd neglected to "follow" the Harmony/Melody club, so was unaware of your request until today. That's been rectified. We have a simple music, and a simple speech demo, used for testing the CDFJ driver as we made changes to it. I'll track down the most current versions and post them later today. I also ported the music driver from Stay Frosty 2 to CDFJ for John to use in Mappy. It does on-the-fly manipulation of the waveforms to generate ADSR. I'll track that down and post it as well.
  24. Can someone point me towards some documentation on CDFJ capability (and examples...?) for CDFJ audio streaming. Particularly the "3 channels" that I've seen alluded to. I have some decent "normal" music playing in my game (envelope-based) but would like to see how CDFJ/streaming could improve this.
  25. Status update from @batari: I'm heading to Austin for Tesla Con Texas later this week, so even if this gets figured out soon I won't be able to do anything with it until next week.
  26.  
  • Recently Browsing   0 members

    • No registered users viewing this page.

×
×
  • Create New...