Jump to content
IGNORED

Any info on StellaDS?


AgentOrange96

Recommended Posts

21 minutes ago, Thomas Jentzsch said:

My favorite pastime. :D 

Mine too... though it can be frustrating when until you find that next breakthrough!

 

Right now, CDFJ games play fine (albeit slowly).

 

I have three older CDF games that are all marked as V1 (the version byte is 01) and of those only Super Cobra works. Draconian and Mappy both crash after the title screen.  I'm handling the different offsets for the V1 fetchers same as current Stella (otherwise all 3 games crash even before the splash screen).  Not sure what I've got wrong... so optimization is on hold until I can get all games working.  I don't have any V0 CDF games to try.

Edited by llabnip
Link to comment
Share on other sites

2 hours ago, llabnip said:

Well... it's not going to set any land-speed-records... but it's something.

 

 

I thought it'd be a bit faster, but that is a promising start!  I suspect the reason is moving the datastream pointer initializations from 6507 to ARM means more ARM code can be run each frame under CDFJ.

 

One thing that might help is the new CDFJ+ framework will be clocked at 60 MHz instead of 70 MHz. The drop is because the ARMs with 64K Flash and up are rated for 60 MHz, so if you started with a 32K CDFJ+ project at 70 MHz then decided to increase to 64K you'd see a drop in performance. It will be possible for a 32K CDFJ+ game to clock itself at 70 MHz, though I don't know if we'll advertise that ability or not.

  • Like 1
Link to comment
Share on other sites

Was able to track done a few examples before heading over to my folks. Contains an early build of Draconian and test ROMs that target specific features of CDF. 

 

CDFv0.zip

 

I believe the test ROMs have been built for each version of CDF, so I should be able to post those as well if you think they'd be useful.

 

Heading over to my folks now, so have a great Thanksgiving!

  • Like 1
Link to comment
Share on other sites

Hope you had a great T-Day, @SpiceWare

 

With a full stomach and some debug, I believe the problem with my emulation and Draconian and Mappy is this code:

 

image.thumb.png.a857ad60d6ab854ec961a64b9ac7ce52.png

Those two games get into this logic in the Thumbulator... and those debug[x] are just some debug counters so I can see what's getting hit.

 

I'm not sure I've got my address test correct here and maybe I've got these skewed... does this make sense:

 

debug[0] was hit 6 times (i.e. CDF1_SetNote called 6 times)

debug[1] was hit 2 times (i.e. CDF1_ResetWave called 2 times)

debug[2] was hit 40,000 times (GetWavePtr... over the course of several minutes)

debug[3] and debug[4] were never called.

debug[5] was called some 23,000 times (not even sure what this does... apparently nothing in my case)

 

Thoughts or clues of any kind might help uncover the mystery here!  Frantic, Boom, Galaga, Lode Runner, RobotWar, RubyQ, WoW, Zevious, Zoo Keeper and Super Cobra all seem to play fine. Will try your V0 tests soon.

 

Edited by llabnip
Link to comment
Share on other sites

Ok... finished with the first big round of optimization on CDF/CDFJ.  Mostly around the data streams (Fast Fetchers and Fast Jumps) which brought a fairly large improvement in speed.

 

The "playables":

Lode Runner is full speed.

rubyQ is full speed.

Frantic is playable in the upper 50 fps (often hitting 60)

Qyx is playable in the upper 50 fps (often hitting 60)

WoW is playable in the upper 50fps (hitting 60 when there are just a few enemies left)

Galaga is playable in the mid-to-lower 50 fps

 

The "nice to see but needs more speed":

Robotwar is around 52 fps

Zookeeper is around 50 fps

Super Cobra Aracade is around 50 fps

Lady Bug Arcade hits the emulator the hardest at 45 fps

 

The "unplayables":

Draconian (62fps on title screen but crashes with the aforementioned music fetchers)

Mappy (crashes with the aforementioned music fetchers)

 

 

 

  • Like 1
Link to comment
Share on other sites

Thanksgiving was good, hope your's was too.

 

My aunt and uncle were over from Bandera (just west of San Anontio) for Thanksgiving.  They're now on a road trip to Branson with my folks, their first road trip in an EV - they seem to be little nervous, even though my folks having made this trip numerous times in Sparky, their Model Y 🤷‍♂️.

 

We decked out Sparky for the Holidays:

 

IMG_3174.thumb.JPG.19501e81d74c8401a31123121da1e012.JPG

 

IMG_3175.thumb.JPG.076d871ad5295eb38878e4d728d68a72.JPG

 

IMG_3176.thumb.JPG.7553c3eea7f9b694700a58b38bf65bbb.JPG

 

IMG_3177.thumb.JPG.9ecd93c019e7dd082e82fa1748c7e652.JPG

 

 

Months ago they asked me to watch their dog while they were on the trip. I've had her since last night and she's very distraught, so I didn't sleep very well last night.  She's also having an issue with her toe nails falling out, resulting in her bleeding all over the house - looked like multiple miniature murder scenes throughout the house 😱. Appears to have stopped, for now. Hard floors cleaned up OK, now I need to see if I have anything to clean up the carpet, else it's off to the stores on black Friday😱.

 

The BUS and CDF drivers written in 32-bit ARM mode, not Thumb mode.  There are a few 32-bit audio related subroutines in the driver that we call from Thumb mode: 

  • 0 = NoteStore /* Update note values (r2 = note, r3 = freq) */
  • 1 = ResetWaveStore /* Reset wave (r2 = wave) */
  • 2 = WavePtrFetch /* Fetch waveform pointer (r2 = waveform, returned in r3) */
  • 3 = WaveSizeStore /* Update waveform size (r2 = waveform, r3 = size) */

Thumbulator doesn't emulate 32-bit mode, so those get passed back to the cartridge class to handle. 32-bit is triggered by using an even address, while Thumb mode is triggered by an odd address.

 

The 0x0000083a has to do with the custom ARM code finishing and returning control to the 32-bit driver.

 

I don't recall what the last else block was for - the thumbCallback(255...) doesn't do anything in the BUS or CDF cartridge classes.

 

40,000 times for GetWavePointer is surprising - it's called twice per frame in Draconian (first thing in OverScan, last thing in VerticalBlank) to find out if the current sample has finished:

 

static void CheckSampleFinished()
{
    if (gCurrentSample)
    {
      if ((getWavePtr(0) >> 21) > SAMPLE_SIZES[gCurrentSample-1])
      {
        if (gCurrentSample == SAMPLE_ALERT_ALERT && gAlertRepeat)
        {
          TriggerSample(SAMPLE_ALERT_ALERT);
          gAlertRepeat = 0;
        }
        else
        {
          EndSample();
        }
      }
    }
}

 

 

  • Like 2
Link to comment
Share on other sites

22 minutes ago, llabnip said:

Ok... finished with the first big round of optimization on CDF/CDFJ.  Mostly around the data streams (Fast Fetchers and Fast Jumps) which brought a fairly large improvement in speed.

Would it make sense to merge your optimizations back to Stella? Or are they too platform specific?

  • Like 1
Link to comment
Share on other sites

4 hours ago, Thomas Jentzsch said:

Would it make sense to merge your optimizations back to Stella? Or are they too platform specific?

Many are platform specific... some are hacks you wouldn't want in the mainline Stella. But some fast things that had a ton of speedup were:

  • Not calling into the execute() for each Thumb instruction - the overhead of the call was not optimized away with GCC at the max settings so I moved the handling of the Thumb loop to inside the execute().
  • Keeping a 16-bit pointer always pointing to the next instruction rather than re-index into the Thumb instruction ROM array. I don't even bother to update the PC register until it's needed (easy to back-calculate from the 16-bit Thumb instruction pointer from the start of ROM).
  • The biggest improvement in speed came from simply using the top 2 bits of the Thumb instruction to binary parse the instruction. So I check if the high bit is set - that puts the instruction into one of two buckets. Then check bit 15 to parse those two buckets into two further buckets.  This way I only have to check those instructions in each bucket which really reduces the long search for the opcode.  Then I did some profiling and found the popular instructions which were often several orders of magnitude more likely to be called - and check them first in each bucket (i.e. ADD big immediate one register, CMP immediate, conditional branch, etc.)

I'd say those 3 things got me almost 40% speed.

 

But probably mainline Stella doesn't need it... processing power keeps going through the roof. I'm in a somewhat special situation in that the classic Nintendo DS/DSi isn't getting any faster. Why did I target that platform?  It was the handheld collecting dust in my closet. And they made more than 150 Million of them so they are everywhere and easy to get on the second hand market. Plus: constraints breed creativity. 

 

Meanwhile... Wizard of Wor Arcade (demo) at 58 fps :)

image.png.daca88713b140d6f64e75784c9bb45b4.png

  • Like 1
Link to comment
Share on other sites

6 hours ago, llabnip said:

Ok... finished with the first big round of optimization on CDF/CDFJ.  Mostly around the data streams (Fast Fetchers and Fast Jumps) which brought a fairly large improvement in speed.

 

The "playables":

Qyx is playable in the upper 50 fps (often hitting 60)

WoW is playable in the upper 50fps (hitting 60 when there are just a few enemies left)

Galaga is playable in the mid-to-lower 50 fps

 

The "nice to see but needs more speed":

Robotwar is around 52 fps

Zookeeper is around 50 fps

Super Cobra Aracde is around 50 fps

That's really good! :thumbsup:  

6 hours ago, llabnip said:

Lady Bug Arcade hits the emulator the hardest at 45 fps

Hmm, I'm surprised this is hitting the emulator the hardest especially since it's one of the most simplest games, although I suspect it's because I update the entire playfield each frame to achieve the 'blending' affect of the orange/green to get white and I update the doors each frame.  There is an option to disable the blending by flipping the right difficulty to 'A' (in this case it will just alternate pink/green lines) which may improve performance, although I suspect I was 'lazy' and didn't put in different code to *not* update the entire screen even in this mode to save ROM.

6 hours ago, llabnip said:

The "unplayables":

Draconian (62fps on title screen but crashes with the aforementioned music fetchers)

Mappy (crashes with the aforementioned music fetchers)

 

That'll be cool if you can get these two to play! 🤞

 

Thanks for all the hard work, great job! 👍

  • Like 1
Link to comment
Share on other sites

8 hours ago, llabnip said:

Many are platform specific... some are hacks you wouldn't want in the mainline Stella. But some fast things that had a ton of speedup were:

  • Not calling into the execute() for each Thumb instruction - the overhead of the call was not optimized away with GCC at the max settings so I moved the handling of the Thumb loop to inside the execute().
  • Keeping a 16-bit pointer always pointing to the next instruction rather than re-index into the Thumb instruction ROM array. I don't even bother to update the PC register until it's needed (easy to back-calculate from the 16-bit Thumb instruction pointer from the start of ROM).
  • The biggest improvement in speed came from simply using the top 2 bits of the Thumb instruction to binary parse the instruction. So I check if the high bit is set - that puts the instruction into one of two buckets. Then check bit 15 to parse those two buckets into two further buckets.  This way I only have to check those instructions in each bucket which really reduces the long search for the opcode.  Then I did some profiling and found the popular instructions which were often several orders of magnitude more likely to be called - and check them first in each bucket (i.e. ADD big immediate one register, CMP immediate, conditional branch, etc.)

I'd say those 3 things got me almost 40% speed.

Thanks. These are very good tips. I will definitely make use of them.

8 hours ago, llabnip said:

But probably mainline Stella doesn't need it... processing power keeps going through the roof.

True, but we are also supporting other platforms which are less powerful too (e.g. RetroN 77). And these are often on the edge of 60Hz too.

8 hours ago, llabnip said:

I'm in a somewhat special situation in that the classic Nintendo DS/DSi isn't getting any faster. Why did I target that platform?  It was the handheld collecting dust in my closet. And they made more than 150 Million of them so they are everywhere and easy to get on the second hand market. Plus: constraints breed creativity. 

That's one of the reason why I am coding for the 2600. 😀:)

  • Like 1
Link to comment
Share on other sites

9 hours ago, llabnip said:

The biggest improvement in speed came from simply using the top 2 bits of the Thumb instruction to binary parse the instruction. So I check if the high bit is set - that puts the instruction into one of two buckets. Then check bit 15 to parse those two buckets into two further buckets.  This way I only have to check those instructions in each bucket which really reduces the long search for the opcode.  Then I did some profiling and found the popular instructions which were often several orders of magnitude more likely to be called - and check them first in each bucket (i.e. ADD big immediate one register, CMP immediate, conditional branch, etc.)

Further thinking about this one, I am really surprised that this makes any difference. AFAIK, compilers are creating branch tables for such switches. And then the order makes no difference and the access time for any opcode is constant. And since we are using enums for the opcodes, this should be straightforward for the compiler.

 

If just checked the assembler created by Visual Studio with debug settings. And even there it uses a branch table:

  switch (decodedOp) {
00000001404D915A  movzx       eax,byte ptr [decodedOp]  
00000001404D915F  mov         dword ptr [rsp+70h],eax  
00000001404D9163  mov         eax,dword ptr [rsp+70h]  
00000001404D9167  dec         eax  
00000001404D9169  mov         dword ptr [rsp+70h],eax  
00000001404D916D  cmp         dword ptr [rsp+70h],48h 	; I think here the debug code checks against the number of defined enums
00000001404D9172  ja          $LN333+5Bh (01404DDD95h)  
00000001404D9178  movsxd      rax,dword ptr [rsp+70h]  
00000001404D917D  lea         rcx,[__ImageBase (0140000000h)]  
00000001404D9184  mov         eax,dword ptr [rcx+rax*4+4DDE5Ch]  
00000001404D918B  add         rax,rcx  
00000001404D918E  jmp         rax  			; this directly jumps to the opcode's code

Could this be a wrong compiler setting on your side?

Edited by Thomas Jentzsch
  • Like 1
Link to comment
Share on other sites

@llabnip I only just noticed that you seem to have based your code on the original Thumbulator class. Which might explain some of your gains.

 

In Stella, we achieved a major speed improvement by decoding the ROM only once (into decodeROM). 

Edited by Thomas Jentzsch
  • Like 3
Link to comment
Share on other sites

Wow... that was the keys to the kingdom, @Thomas Jentzsch!

 

About 5% speedup on DPC+ games and more than 10% speedup across the board for CDF/CDFJ!

 

For DPC+, everything runs well above 60 except Scramble which dips down into the upper 50s but isn't noticeable - perfectly playable.

 

For CDF/CDFJ, more games run at full speed. WoW is now full speed. Super Cobra Arcade runs in the upper 50s with gusts to 60.  Even Lady Bug Arcade is now at 52 fps (up from 45). 

  • Like 4
Link to comment
Share on other sites

Got the latest source for Stella. Had a minor issue in that git would not work, was complaining that xcrun was missing.

 

I did upgrade my Mac Pro to macOS Monterey since I last used Xcode, so suspect it's related to that.  Found this and issuing just this command fixed it:

 

sudo xcode-select --reset

 

 

I put a breakpoint in the CDF1_GetWavePtr block in the Thumbulator. When I launched Draconian:

  1. menu came up fine
  2. was able to start game
  3. game screen showed up
  4. opening tune played to completion
  5. break occurred after opening tune finished.  This is when the phrase "Blast Off" is triggered.

 

 image.thumb.png.a1397c4f87daa26d50c6840b86a5fe3e.png

 

So I looked at Draconian's source again and sure enough, GetWavePtr is not called twice every frame but only when a sample is actively playing (gCurrentSample != 0).

 

static void EndSample()
{
  gCurrentSample = 0;  // 0 = no sample
  setNote(0, 0);
  resetWave(0);
  setSamplePtr((int) & AUDV0);
}

static void CheckSampleFinished()
{ 
    if (gCurrentSample)
    {
      if ((getWavePtr(0) >> 21) > SAMPLE_SIZES[gCurrentSample-1])
      {       
        if (gCurrentSample == SAMPLE_ALERT_ALERT && gAlertRepeat)
        {
          TriggerSample(SAMPLE_ALERT_ALERT);
          gAlertRepeat = 0;
        }
        else
        {
          EndSample();
        }
      }
    }
}

 

 

Blast Off lasts just over 1 second, so the initial phrase should only hit debug[2] less than 200 times before stopping until the next phrase is triggered. Have no idea why your debug[2] was hit 40,000 times.

  • Like 1
Link to comment
Share on other sites

I'm seeing debug[2] increase at a rate of about 250 per second.  I think the 40k was a mistake - related to my using debug[2]++ somewhere else in my code for unrelated debug.

 

So I tried just hard-coding the return from GetWavePtr to 0x7FFFFFFF seeing if that would trigger Draconian to just "move along" but no change in behavior. I can still see it hammering GetWavePtr at a rate of 250 per second. Nothing is crashing but nothing is progressing either.

 

Any other memory I need to ensure is being handled right?  Do you read any of the audio stuff via a different mechanism?

  • Like 1
Link to comment
Share on other sites

@Thomas Jentzsch with the big speedup using the decodedROM[] I was thinking perhaps that could be enhanced. For example, the conditional branch is heavily used in most programs - Galagon calls it about 200k per second. Since each entry in the 8-bit decoded table (256 possibilities) only has 72 (rough count) opcodes... some of the most heavily used opcodes could be further split during decoding. The conditional branch, for example, could be split into the 13 different types (branch if zero, branch if not zero, etc). This would just add to the op-code count but would save the shift, AND and switch for that instruction. 

 

I'd have to do some profiling, but I'm guessing there are other heavy hitting instructions that would benefit from further pre-decode logic.

  • Like 1
Link to comment
Share on other sites

The game does not move the ship until Blast Off is finished. The start of ProcessJoystick is:

 

static void ProcessJoystick()
{
  int i;
  int x;
  int y;
  int direction;
  int free_shot[2];
  int offset;
  
  x = 0;
  y = 0;
  
  if (zil.gRoundCleared)
    return;
  
  if (gStartSequence)
  {
    // 2 = show "SECTOR #" while waiting for tune to finish or gGameDelay to count down, then trigger "Blast Off",
    // 1 = show "SCORE RADAR LIVES" wait for "Blast off" to finished
    // 0 = round fully active, ship starts to move

    if (gStartSequence == 2)
    {
      // wait for tune to finish, or game delay to run out
      if (gGameDelay)
        gGameDelay--;
      if (SOUND_VOL[gActiveSound[0]] == 0 && SOUND_VOL[gActiveSound[1]] == 0 && gGameDelay == 0)
      {
        gStartSequence--;
        TriggerSample(SAMPLE_BLAST_OFF);
      }
    }
    else if (gStartSequence == 1)
    {
      if (gCurrentSample == 0)
        gStartSequence--;
    }
    gIdleConditionRedTimer = gIdleConditionRedMax;  // reset idle timer
    return;   <<<===--- if tune or "blast off" are still playing then return from routine before processing joystick input
  }
  
  direction = joystick_direction[SWCHA >> 4];
  
  ... process joystick input
}

 

 

 

50 minutes ago, llabnip said:

hard-coding the return from GetWavePtr to 0x7FFFFFFF

 

0x7FFFFFFF won't work for BlastOff. The check is:

if ((getWavePtr(0) >> 21) > SAMPLE_SIZES[gCurrentSample-1])

 

 

0x7FFFFFFF >> 21 is 0x3FF and size of Blast Off's sample is 0x04db.

 

sample_blastoff          5aa3              (R )
sample_blastoff_size     04db              (R )

 

getWavePtr is unsigned int, try returning 0xFFFFFFFF

 

unsigned int getWavePtr(int wave) {
  unsigned int ptr;
  asm volatile(
    "mov r2, %1\n\t"
    "ldr r4, =0x759\n\t"
    "mov lr, pc\n\t"
    "bx r4\n\r"
    "mov %0, r2\n\r"
    : "=r" (ptr) : "r" (wave)
    : "r2", "r4", "lr", "cc");
  return ptr;
}

 

 

 

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...