Thomas Jentzsch Posted November 24, 2022 Share Posted November 24, 2022 32 minutes ago, llabnip said: @SpiceWare Well... it's not going to set any land-speed-records... but it's something. How many FPS? Quote Link to comment Share on other sites More sharing options...
+wavemotion Posted November 24, 2022 Share Posted November 24, 2022 19 minutes ago, Thomas Jentzsch said: How many FPS? It's on the upper left of the lower screen... 42 currently. Most other CDFJ games are similar. Working on optimization... if I can get it into the mid-to-upper-50s they will at least be playable. Kinda. Quote Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted November 24, 2022 Share Posted November 24, 2022 24 minutes ago, llabnip said: Working on optimization... My favorite pastime. 1 Quote Link to comment Share on other sites More sharing options...
+wavemotion Posted November 24, 2022 Share Posted November 24, 2022 (edited) 21 minutes ago, Thomas Jentzsch said: My favorite pastime. Mine too... though it can be frustrating when until you find that next breakthrough! Right now, CDFJ games play fine (albeit slowly). I have three older CDF games that are all marked as V1 (the version byte is 01) and of those only Super Cobra works. Draconian and Mappy both crash after the title screen. I'm handling the different offsets for the V1 fetchers same as current Stella (otherwise all 3 games crash even before the splash screen). Not sure what I've got wrong... so optimization is on hold until I can get all games working. I don't have any V0 CDF games to try. Edited November 24, 2022 by llabnip Quote Link to comment Share on other sites More sharing options...
+SpiceWare Posted November 24, 2022 Share Posted November 24, 2022 2 hours ago, llabnip said: Well... it's not going to set any land-speed-records... but it's something. I thought it'd be a bit faster, but that is a promising start! I suspect the reason is moving the datastream pointer initializations from 6507 to ARM means more ARM code can be run each frame under CDFJ. One thing that might help is the new CDFJ+ framework will be clocked at 60 MHz instead of 70 MHz. The drop is because the ARMs with 64K Flash and up are rated for 60 MHz, so if you started with a 32K CDFJ+ project at 70 MHz then decided to increase to 64K you'd see a drop in performance. It will be possible for a 32K CDFJ+ game to clock itself at 70 MHz, though I don't know if we'll advertise that ability or not. 1 Quote Link to comment Share on other sites More sharing options...
+SpiceWare Posted November 24, 2022 Share Posted November 24, 2022 1 hour ago, llabnip said: I don't have any V0 CDF games to try. I'm sure I have something somewhere. Getting ready to head over to my folks for Thanksgiving, so have left myself a note to look for them when I get back. 1 Quote Link to comment Share on other sites More sharing options...
+wavemotion Posted November 24, 2022 Share Posted November 24, 2022 As I await Turkey… I suspect it has to do with not advancing the music fetchers. I didn’t port the music fetchers yet. Maybe the game is waiting for some auto advance in that area. Not near my computer but will toy more tonight. Quote Link to comment Share on other sites More sharing options...
+SpiceWare Posted November 24, 2022 Share Posted November 24, 2022 Was able to track done a few examples before heading over to my folks. Contains an early build of Draconian and test ROMs that target specific features of CDF. CDFv0.zip I believe the test ROMs have been built for each version of CDF, so I should be able to post those as well if you think they'd be useful. Heading over to my folks now, so have a great Thanksgiving! 1 Quote Link to comment Share on other sites More sharing options...
+wavemotion Posted November 24, 2022 Share Posted November 24, 2022 (edited) Hope you had a great T-Day, @SpiceWare With a full stomach and some debug, I believe the problem with my emulation and Draconian and Mappy is this code: Those two games get into this logic in the Thumbulator... and those debug[x] are just some debug counters so I can see what's getting hit. I'm not sure I've got my address test correct here and maybe I've got these skewed... does this make sense: debug[0] was hit 6 times (i.e. CDF1_SetNote called 6 times) debug[1] was hit 2 times (i.e. CDF1_ResetWave called 2 times) debug[2] was hit 40,000 times (GetWavePtr... over the course of several minutes) debug[3] and debug[4] were never called. debug[5] was called some 23,000 times (not even sure what this does... apparently nothing in my case) Thoughts or clues of any kind might help uncover the mystery here! Frantic, Boom, Galaga, Lode Runner, RobotWar, RubyQ, WoW, Zevious, Zoo Keeper and Super Cobra all seem to play fine. Will try your V0 tests soon. Edited November 24, 2022 by llabnip Quote Link to comment Share on other sites More sharing options...
+wavemotion Posted November 25, 2022 Share Posted November 25, 2022 Ok... finished with the first big round of optimization on CDF/CDFJ. Mostly around the data streams (Fast Fetchers and Fast Jumps) which brought a fairly large improvement in speed. The "playables": Lode Runner is full speed. rubyQ is full speed. Frantic is playable in the upper 50 fps (often hitting 60) Qyx is playable in the upper 50 fps (often hitting 60) WoW is playable in the upper 50fps (hitting 60 when there are just a few enemies left) Galaga is playable in the mid-to-lower 50 fps The "nice to see but needs more speed": Robotwar is around 52 fps Zookeeper is around 50 fps Super Cobra Aracade is around 50 fps Lady Bug Arcade hits the emulator the hardest at 45 fps The "unplayables": Draconian (62fps on title screen but crashes with the aforementioned music fetchers) Mappy (crashes with the aforementioned music fetchers) 1 Quote Link to comment Share on other sites More sharing options...
+SpiceWare Posted November 25, 2022 Share Posted November 25, 2022 Thanksgiving was good, hope your's was too. My aunt and uncle were over from Bandera (just west of San Anontio) for Thanksgiving. They're now on a road trip to Branson with my folks, their first road trip in an EV - they seem to be little nervous, even though my folks having made this trip numerous times in Sparky, their Model Y 🤷♂️. We decked out Sparky for the Holidays: Months ago they asked me to watch their dog while they were on the trip. I've had her since last night and she's very distraught, so I didn't sleep very well last night. She's also having an issue with her toe nails falling out, resulting in her bleeding all over the house - looked like multiple miniature murder scenes throughout the house 😱. Appears to have stopped, for now. Hard floors cleaned up OK, now I need to see if I have anything to clean up the carpet, else it's off to the stores on black Friday😱. The BUS and CDF drivers written in 32-bit ARM mode, not Thumb mode. There are a few 32-bit audio related subroutines in the driver that we call from Thumb mode: 0 = NoteStore /* Update note values (r2 = note, r3 = freq) */ 1 = ResetWaveStore /* Reset wave (r2 = wave) */ 2 = WavePtrFetch /* Fetch waveform pointer (r2 = waveform, returned in r3) */ 3 = WaveSizeStore /* Update waveform size (r2 = waveform, r3 = size) */ Thumbulator doesn't emulate 32-bit mode, so those get passed back to the cartridge class to handle. 32-bit is triggered by using an even address, while Thumb mode is triggered by an odd address. The 0x0000083a has to do with the custom ARM code finishing and returning control to the 32-bit driver. I don't recall what the last else block was for - the thumbCallback(255...) doesn't do anything in the BUS or CDF cartridge classes. 40,000 times for GetWavePointer is surprising - it's called twice per frame in Draconian (first thing in OverScan, last thing in VerticalBlank) to find out if the current sample has finished: static void CheckSampleFinished() { if (gCurrentSample) { if ((getWavePtr(0) >> 21) > SAMPLE_SIZES[gCurrentSample-1]) { if (gCurrentSample == SAMPLE_ALERT_ALERT && gAlertRepeat) { TriggerSample(SAMPLE_ALERT_ALERT); gAlertRepeat = 0; } else { EndSample(); } } } } 2 Quote Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted November 25, 2022 Share Posted November 25, 2022 22 minutes ago, llabnip said: Ok... finished with the first big round of optimization on CDF/CDFJ. Mostly around the data streams (Fast Fetchers and Fast Jumps) which brought a fairly large improvement in speed. Would it make sense to merge your optimizations back to Stella? Or are they too platform specific? 1 Quote Link to comment Share on other sites More sharing options...
+wavemotion Posted November 25, 2022 Share Posted November 25, 2022 4 hours ago, Thomas Jentzsch said: Would it make sense to merge your optimizations back to Stella? Or are they too platform specific? Many are platform specific... some are hacks you wouldn't want in the mainline Stella. But some fast things that had a ton of speedup were: Not calling into the execute() for each Thumb instruction - the overhead of the call was not optimized away with GCC at the max settings so I moved the handling of the Thumb loop to inside the execute(). Keeping a 16-bit pointer always pointing to the next instruction rather than re-index into the Thumb instruction ROM array. I don't even bother to update the PC register until it's needed (easy to back-calculate from the 16-bit Thumb instruction pointer from the start of ROM). The biggest improvement in speed came from simply using the top 2 bits of the Thumb instruction to binary parse the instruction. So I check if the high bit is set - that puts the instruction into one of two buckets. Then check bit 15 to parse those two buckets into two further buckets. This way I only have to check those instructions in each bucket which really reduces the long search for the opcode. Then I did some profiling and found the popular instructions which were often several orders of magnitude more likely to be called - and check them first in each bucket (i.e. ADD big immediate one register, CMP immediate, conditional branch, etc.) I'd say those 3 things got me almost 40% speed. But probably mainline Stella doesn't need it... processing power keeps going through the roof. I'm in a somewhat special situation in that the classic Nintendo DS/DSi isn't getting any faster. Why did I target that platform? It was the handheld collecting dust in my closet. And they made more than 150 Million of them so they are everywhere and easy to get on the second hand market. Plus: constraints breed creativity. Meanwhile... Wizard of Wor Arcade (demo) at 58 fps 1 Quote Link to comment Share on other sites More sharing options...
+johnnywc Posted November 25, 2022 Share Posted November 25, 2022 49 minutes ago, llabnip said: Meanwhile... Wizard of Wor Arcade (demo) at 58 fps Awesome! 👍 I really need to dig up a DSi somewhere... 👀 1 Quote Link to comment Share on other sites More sharing options...
+johnnywc Posted November 25, 2022 Share Posted November 25, 2022 6 hours ago, llabnip said: Ok... finished with the first big round of optimization on CDF/CDFJ. Mostly around the data streams (Fast Fetchers and Fast Jumps) which brought a fairly large improvement in speed. The "playables": Qyx is playable in the upper 50 fps (often hitting 60) WoW is playable in the upper 50fps (hitting 60 when there are just a few enemies left) Galaga is playable in the mid-to-lower 50 fps The "nice to see but needs more speed": Robotwar is around 52 fps Zookeeper is around 50 fps Super Cobra Aracde is around 50 fps That's really good! 6 hours ago, llabnip said: Lady Bug Arcade hits the emulator the hardest at 45 fps Hmm, I'm surprised this is hitting the emulator the hardest especially since it's one of the most simplest games, although I suspect it's because I update the entire playfield each frame to achieve the 'blending' affect of the orange/green to get white and I update the doors each frame. There is an option to disable the blending by flipping the right difficulty to 'A' (in this case it will just alternate pink/green lines) which may improve performance, although I suspect I was 'lazy' and didn't put in different code to *not* update the entire screen even in this mode to save ROM. 6 hours ago, llabnip said: The "unplayables": Draconian (62fps on title screen but crashes with the aforementioned music fetchers) Mappy (crashes with the aforementioned music fetchers) That'll be cool if you can get these two to play! 🤞 Thanks for all the hard work, great job! 👍 1 Quote Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted November 26, 2022 Share Posted November 26, 2022 8 hours ago, llabnip said: Many are platform specific... some are hacks you wouldn't want in the mainline Stella. But some fast things that had a ton of speedup were: Not calling into the execute() for each Thumb instruction - the overhead of the call was not optimized away with GCC at the max settings so I moved the handling of the Thumb loop to inside the execute(). Keeping a 16-bit pointer always pointing to the next instruction rather than re-index into the Thumb instruction ROM array. I don't even bother to update the PC register until it's needed (easy to back-calculate from the 16-bit Thumb instruction pointer from the start of ROM). The biggest improvement in speed came from simply using the top 2 bits of the Thumb instruction to binary parse the instruction. So I check if the high bit is set - that puts the instruction into one of two buckets. Then check bit 15 to parse those two buckets into two further buckets. This way I only have to check those instructions in each bucket which really reduces the long search for the opcode. Then I did some profiling and found the popular instructions which were often several orders of magnitude more likely to be called - and check them first in each bucket (i.e. ADD big immediate one register, CMP immediate, conditional branch, etc.) I'd say those 3 things got me almost 40% speed. Thanks. These are very good tips. I will definitely make use of them. 8 hours ago, llabnip said: But probably mainline Stella doesn't need it... processing power keeps going through the roof. True, but we are also supporting other platforms which are less powerful too (e.g. RetroN 77). And these are often on the edge of 60Hz too. 8 hours ago, llabnip said: I'm in a somewhat special situation in that the classic Nintendo DS/DSi isn't getting any faster. Why did I target that platform? It was the handheld collecting dust in my closet. And they made more than 150 Million of them so they are everywhere and easy to get on the second hand market. Plus: constraints breed creativity. That's one of the reason why I am coding for the 2600. 😀 1 Quote Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted November 26, 2022 Share Posted November 26, 2022 (edited) 9 hours ago, llabnip said: The biggest improvement in speed came from simply using the top 2 bits of the Thumb instruction to binary parse the instruction. So I check if the high bit is set - that puts the instruction into one of two buckets. Then check bit 15 to parse those two buckets into two further buckets. This way I only have to check those instructions in each bucket which really reduces the long search for the opcode. Then I did some profiling and found the popular instructions which were often several orders of magnitude more likely to be called - and check them first in each bucket (i.e. ADD big immediate one register, CMP immediate, conditional branch, etc.) Further thinking about this one, I am really surprised that this makes any difference. AFAIK, compilers are creating branch tables for such switches. And then the order makes no difference and the access time for any opcode is constant. And since we are using enums for the opcodes, this should be straightforward for the compiler. If just checked the assembler created by Visual Studio with debug settings. And even there it uses a branch table: switch (decodedOp) { 00000001404D915A movzx eax,byte ptr [decodedOp] 00000001404D915F mov dword ptr [rsp+70h],eax 00000001404D9163 mov eax,dword ptr [rsp+70h] 00000001404D9167 dec eax 00000001404D9169 mov dword ptr [rsp+70h],eax 00000001404D916D cmp dword ptr [rsp+70h],48h ; I think here the debug code checks against the number of defined enums 00000001404D9172 ja $LN333+5Bh (01404DDD95h) 00000001404D9178 movsxd rax,dword ptr [rsp+70h] 00000001404D917D lea rcx,[__ImageBase (0140000000h)] 00000001404D9184 mov eax,dword ptr [rcx+rax*4+4DDE5Ch] 00000001404D918B add rax,rcx 00000001404D918E jmp rax ; this directly jumps to the opcode's code Could this be a wrong compiler setting on your side? Edited November 26, 2022 by Thomas Jentzsch 1 Quote Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted November 26, 2022 Share Posted November 26, 2022 (edited) @llabnip I only just noticed that you seem to have based your code on the original Thumbulator class. Which might explain some of your gains. In Stella, we achieved a major speed improvement by decoding the ROM only once (into decodeROM). Edited November 26, 2022 by Thomas Jentzsch 3 Quote Link to comment Share on other sites More sharing options...
+wavemotion Posted November 26, 2022 Share Posted November 26, 2022 Oh! That's neat... converting to the decodeROM logic now. Will let you know what it produces. The extra fetch of the decode will hopefully be more than offset by the jump table lookup! 3 Quote Link to comment Share on other sites More sharing options...
+wavemotion Posted November 26, 2022 Share Posted November 26, 2022 Wow... that was the keys to the kingdom, @Thomas Jentzsch! About 5% speedup on DPC+ games and more than 10% speedup across the board for CDF/CDFJ! For DPC+, everything runs well above 60 except Scramble which dips down into the upper 50s but isn't noticeable - perfectly playable. For CDF/CDFJ, more games run at full speed. WoW is now full speed. Super Cobra Arcade runs in the upper 50s with gusts to 60. Even Lady Bug Arcade is now at 52 fps (up from 45). 4 Quote Link to comment Share on other sites More sharing options...
+SpiceWare Posted November 26, 2022 Share Posted November 26, 2022 Got the latest source for Stella. Had a minor issue in that git would not work, was complaining that xcrun was missing. I did upgrade my Mac Pro to macOS Monterey since I last used Xcode, so suspect it's related to that. Found this and issuing just this command fixed it: sudo xcode-select --reset I put a breakpoint in the CDF1_GetWavePtr block in the Thumbulator. When I launched Draconian: menu came up fine was able to start game game screen showed up opening tune played to completion break occurred after opening tune finished. This is when the phrase "Blast Off" is triggered. So I looked at Draconian's source again and sure enough, GetWavePtr is not called twice every frame but only when a sample is actively playing (gCurrentSample != 0). static void EndSample() { gCurrentSample = 0; // 0 = no sample setNote(0, 0); resetWave(0); setSamplePtr((int) & AUDV0); } static void CheckSampleFinished() { if (gCurrentSample) { if ((getWavePtr(0) >> 21) > SAMPLE_SIZES[gCurrentSample-1]) { if (gCurrentSample == SAMPLE_ALERT_ALERT && gAlertRepeat) { TriggerSample(SAMPLE_ALERT_ALERT); gAlertRepeat = 0; } else { EndSample(); } } } } Blast Off lasts just over 1 second, so the initial phrase should only hit debug[2] less than 200 times before stopping until the next phrase is triggered. Have no idea why your debug[2] was hit 40,000 times. 1 Quote Link to comment Share on other sites More sharing options...
+wavemotion Posted November 26, 2022 Share Posted November 26, 2022 I'm seeing debug[2] increase at a rate of about 250 per second. I think the 40k was a mistake - related to my using debug[2]++ somewhere else in my code for unrelated debug. So I tried just hard-coding the return from GetWavePtr to 0x7FFFFFFF seeing if that would trigger Draconian to just "move along" but no change in behavior. I can still see it hammering GetWavePtr at a rate of 250 per second. Nothing is crashing but nothing is progressing either. Any other memory I need to ensure is being handled right? Do you read any of the audio stuff via a different mechanism? 1 Quote Link to comment Share on other sites More sharing options...
+SpiceWare Posted November 26, 2022 Share Posted November 26, 2022 Hmm, 250 still seems to be 2x what it should be for 1 second. I'll review the source to see if I can spot anything else. 1 Quote Link to comment Share on other sites More sharing options...
+wavemotion Posted November 26, 2022 Share Posted November 26, 2022 @Thomas Jentzsch with the big speedup using the decodedROM[] I was thinking perhaps that could be enhanced. For example, the conditional branch is heavily used in most programs - Galagon calls it about 200k per second. Since each entry in the 8-bit decoded table (256 possibilities) only has 72 (rough count) opcodes... some of the most heavily used opcodes could be further split during decoding. The conditional branch, for example, could be split into the 13 different types (branch if zero, branch if not zero, etc). This would just add to the op-code count but would save the shift, AND and switch for that instruction. I'd have to do some profiling, but I'm guessing there are other heavy hitting instructions that would benefit from further pre-decode logic. 1 Quote Link to comment Share on other sites More sharing options...
+SpiceWare Posted November 26, 2022 Share Posted November 26, 2022 The game does not move the ship until Blast Off is finished. The start of ProcessJoystick is: static void ProcessJoystick() { int i; int x; int y; int direction; int free_shot[2]; int offset; x = 0; y = 0; if (zil.gRoundCleared) return; if (gStartSequence) { // 2 = show "SECTOR #" while waiting for tune to finish or gGameDelay to count down, then trigger "Blast Off", // 1 = show "SCORE RADAR LIVES" wait for "Blast off" to finished // 0 = round fully active, ship starts to move if (gStartSequence == 2) { // wait for tune to finish, or game delay to run out if (gGameDelay) gGameDelay--; if (SOUND_VOL[gActiveSound[0]] == 0 && SOUND_VOL[gActiveSound[1]] == 0 && gGameDelay == 0) { gStartSequence--; TriggerSample(SAMPLE_BLAST_OFF); } } else if (gStartSequence == 1) { if (gCurrentSample == 0) gStartSequence--; } gIdleConditionRedTimer = gIdleConditionRedMax; // reset idle timer return; <<<===--- if tune or "blast off" are still playing then return from routine before processing joystick input } direction = joystick_direction[SWCHA >> 4]; ... process joystick input } 50 minutes ago, llabnip said: hard-coding the return from GetWavePtr to 0x7FFFFFFF 0x7FFFFFFF won't work for BlastOff. The check is: if ((getWavePtr(0) >> 21) > SAMPLE_SIZES[gCurrentSample-1]) 0x7FFFFFFF >> 21 is 0x3FF and size of Blast Off's sample is 0x04db. sample_blastoff 5aa3 (R ) sample_blastoff_size 04db (R ) getWavePtr is unsigned int, try returning 0xFFFFFFFF unsigned int getWavePtr(int wave) { unsigned int ptr; asm volatile( "mov r2, %1\n\t" "ldr r4, =0x759\n\t" "mov lr, pc\n\t" "bx r4\n\r" "mov %0, r2\n\r" : "=r" (ptr) : "r" (wave) : "r2", "r4", "lr", "cc"); return ptr; } 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.