Jump to content
IGNORED

Any info on StellaDS?


AgentOrange96

Recommended Posts

6 minutes ago, SpiceWare said:

Is your CDFCallBack() returning a signed int?

 

 Nope, it's uInt32.

 

Case 2 looks wrong. StellaDS:

 

uInt32 CDFCallback(uInt8 function, uInt32 value1, uInt32 value2)
{
  switch (function)
  {
    case 0:
      // _SetNote - set the note/frequency
      myMusicFrequencies[value1] = value2;
      break;

      // _ResetWave - reset counter,
      // used to make sure digital samples start from the beginning
    case 1:
      myMusicCounters[value1] = 0;
      break;

      // _GetWavePtr - return the counter
    case 2:
        myMusicCounters[value1] = myMusicWaveformSize[value1];  //TBD   <<<===--- ??????????????
        return myMusicCounters[value1];

      // _SetWaveSize - set size of waveform buffer
    case 3:
      myMusicWaveformSize[value1] = value2;
      debug[18] = value2;
      break;
  }

  return 0;
}

 

 

Vs Stella:

uInt32 CartridgeCDF::thumbCallback(uInt8 function, uInt32 value1, uInt32 value2)
{
  switch (function)
  {
    case 0:
      // _SetNote - set the note/frequency
      myMusicFrequencies[value1] = value2;
      break;

      // _ResetWave - reset counter,
      // used to make sure digital samples start from the beginning
    case 1:
      myMusicCounters[value1] = 0;
      break;

      // _GetWavePtr - return the counter
    case 2:
      return myMusicCounters[value1];

      // _SetWaveSize - set size of waveform buffer
    case 3:
      myMusicWaveformSize[value1] = value2;
      break;

    default:
      break;
  }

  return 0;
}

 

  • Like 1
Link to comment
Share on other sites

52 minutes ago, llabnip said:

@Thomas Jentzsch with the big speedup using the decodedROM[] I was thinking perhaps that could be enhanced. For example, the conditional branch is heavily used in most programs - Galagon calls it about 200k per second. Since each entry in the 8-bit decoded table (256 possibilities) only has 72 (rough count) opcodes... some of the most heavily used opcodes could be further split during decoding. The conditional branch, for example, could be split into the 13 different types (branch if zero, branch if not zero, etc). This would just add to the op-code count but would save the shift, AND and switch for that instruction. 

Sounds like a good idea to me, too.

  • Like 2
Link to comment
Share on other sites

1 hour ago, Thomas Jentzsch said:

Sounds like a good idea to me, too.

Just did some quick-and-dirty profiling on some of the CDFJ games... 

 

b1, add3, cmp1, cmp2, ldr4 and ldrb1 were all massively hit.

 

I decode split the heaviest hitters:

 

      add2_0, add2_1, add2_2, add2_3, add2_4, add2_5, add2_6, add2_7,
      b1_000, b1_100, b1_200, b1_300, b1_400, b1_500, b1_600, b1_700,
      b1_800, b1_900, b1_a00, b1_b00, b1_c00, b1_d00, b1_e00, b1_f00,
      cmp1_0, cmp1_1, cmp1_2, cmp1_3, cmp1_4, cmp1_5, cmp1_6, cmp1_7,
      ldr4_0, ldr4_1, ldr4_2, ldr4_3, ldr4_4, ldr4_5, ldr4_6, ldr4_7,

 

That alone gave me a 3% speed boost... another frame or two on many of the games. Lady Bug Arcade is now up to 54 fps... everything else is doing better than that and most of the ARM stuff is 60+.

 

Draconian and Mappy are still unplayable.  @SpiceWare I tried returning 0xFFFFFFFF with no joy.

 

Mappy isn't even reading CDF1_GetWavePtr but it is calling lots of CDF1_SetNote.  I keep thinking maybe it's just that I'm not handling CDF V1 properly (i.e. the whole audio stuff is a red-herring) but the V1 Super Cobra Arcade plays perfectly.

 

Frustrating!  But I appreciate all the help this crew has shown...

 

  • Like 2
Link to comment
Share on other sites

19 hours ago, johnnywc said:

Hmm, I'm surprised this is hitting the emulator the hardest especially since it's one of the most simplest games, although I suspect it's because I update the entire playfield each frame to achieve the 'blending' affect of the orange/green to get white and I update the doors each frame.  There is an option to disable the blending by flipping the right difficulty to 'A' (in this case it will just alternate pink/green lines) which may improve performance, although I suspect I was 'lazy' and didn't put in different code to *not* update the entire screen even in this mode to save ROM.👍

LBA uses about ~31000 ARM cycles in VBlank and just ~3000 in Overscan, Draconian needs ~32000/7500. So there must be something special to LBA.

 

My CPU goes up to 340 FPS for LBA and 360 FPS for Draconian. So the problem is not limited to StellaDS. And I suppose Draconian might be slow too.

 

Edit: The other ARM games I checked use significantly less ARM CPU cycles. E.g. SCA 20000/2000 and Galagon 23000/3000.

Edited by Thomas Jentzsch
  • Like 1
Link to comment
Share on other sites

I just checked in 5.9f with all of the speed improvements to ARM-assisted games (or, in the case of CDFJ, I'm going to start calling them 6502-Assisted games :) )

 

I just put in all of the CDFJ games I have into the internal database so that games are centered and scaled on screen as best I can (the DS only has 192 vertical pixel rows... so especially @johnnywc's games need a little scaling and tweaking to make them look as good as possible with the screen limitations).

 

If anyone happens to have the md5 sums of the official non-demo ROMs, I'd love those hashes to add to my database.

 

I'm going to keep chipping away at my bug related to the ARM-assisted-audio. It can't hide forever!  I'm hopeful to release V6.0 of StellaDS with full support for the entire DPC+, CDF and CDFJ library (but not CDFJ-Plus yet... would rather spend my time getting Draconian and Mappy to play first). 

  • Like 1
Link to comment
Share on other sites

@SpiceWare - got it!

 

Not fully fixed yet... but now I know what the problem was. I don't update the PC counter on every instruction - instead I use a pointer to the Thumb code and only when I need to manipulate (e.g. Branch or similar) do I patch up the PC counter.  I missed a patch somewhere. 

 

This code now updates the PC on every instruction - which is too slow but it allows the game to work perfectly (Mappy seems to work as well).  I will find the culprit with coffee in the morning - but nothing can prevent it from working now!  I'm off to see family for the evening - but this gets fully solved in the AM.

 

image.thumb.png.74f86bb6f016a6ce779716b9ae4a1d15.png

 

 

Edit! Found the culprit.  MOV(3) can affect the PC register and I wasn't patching up my PC counter.   Made the fix - Draconian plays beautifully at almost 60 fps (just dips down a few frames... nothing serious).  No voice yet - that's tomorrow.  Maybe.  I'd be happy just that it plays buttery-smooth now!

Edited by llabnip
  • Like 4
Link to comment
Share on other sites

Yeah... the fast music fetchers are a problem still in terms of horrible sound quality.  I had to disable them for now - which sucks a bit on Draconian as it's just amazing sound when playing on my Harmony cart... but at least the gameplay is in tact and is still super fun.  Mappy suffers more as it has a higher reliance on the digital audio stuff. 

 

My sound driver just can't cope - I can almost hear it say "Alert Alert" and "Blast Off" but it sounds like the sound is made from three rocks hitting each other in quick succession.   

 

I'll have to work on it more - but I'm 30+ hours of debug into just getting the games to play and I might just release 6.0 with a few more tweaks (mostly to get the screen scaled right... I've done so for Draconian but not yet for Mappy hence your cut-off score... you can tweak the screen settings yourself in the configuration but I like to have the stuff work as close to right so users get a nice visual "out of the box").

 

 

  • Like 2
Link to comment
Share on other sites

I found my son's DSi finally! 👀  :)  I really want to get this emulator installed and try out these games (and see why LB:A is so slow :? ).  Sad to say, I've never even used a DSi :dunce: .  Is there a quick guide on what needs to be done to get StellaDS running?  I see an SD card slot on the back so I assume that's where the emulator and ROMs will be stored?  

  • Like 3
Link to comment
Share on other sites

Cool! So if it's really a DSi, the SD card slot should be on the right side covered up by a little flap.  The slot on the back is the cartridge port.  

 

If it's an older DS, it will not have the SD slot on the right side.  The only way to make that play games is via a flash cart - they are cheap (an R4i clone cart is like $25) but they will only play the more basic games (4K/8K and some 16K/32K games that aren't too taxing). The DSi has a 2x CPU that is unlocked for homebrews via the SD card.

 

Most of us follow the guide here:  https://dsi.cfw.guide/get-started.html#requirements 

 

I use the "Memory Pit" exploit that basically replaces one file on the SD card that is activated when your DSi "takes a picture" - this is the foot in the door that is needed to launch the custom menu.

 

I also use Twilight Menu++ as my menu launcher of choice.

 

With that, it's just a matter of downloading the StellaDS.nds (or any of the other emulators I worked on) and placing your ROMs wherever you want (recommend /roms/2600 as it will start looking there). 
 

If you struggle, I’ll offer to build you an SD card and ship it free of charge. I owe you at least that for your great ports over the years!

 

Edited by llabnip
  • Like 2
  • Thanks 1
Link to comment
Share on other sites

So just for laughs and to bring this to a logical conclusion... I did try to jam in CDFJ+ support and tried the only game demo ROM I have: Gorf (Demo V3)

 

image.png.a9eec0fa190a940b9a9edffb75e2aa9b.png

 

It worked... kinda. The emulator thinks it's running 60 fps (throttled 62) but the game is running about half the speed it does on a real Atari.

 

Possibly related: there is no sound. 

 

Does Gorf use the fast music fetchers?  I see that it uses the offset for LDA fast fetchers (and does not appear to use LDX or LDY fast fetchers - if this is not correct, please let me know and I'll dig in more!).

 

 

  • Like 1
Link to comment
Share on other sites

As an aside... I did just check in a daily build 6.0a with the fast music fetchers enabled for CDF/CDFJ (didn't want it in the official 6.0 release until I got more time on it).

 

It doesn't sound good - but after having it on/off, I can say that Draconian is a better experience with it enabled. More sound effects (like explosions) happen properly and the voice, while it sounds like two-cups-and-wax-string, is still recognizable especially for the 'Alert Alert' which warns me that I'm about to get crushed :)

 

I'll keep tinkering. I'm starting to understand things more - I think the fast music fetchers are used in combo with the sort of direct audio output mode of the TIA?  If so... that's always been a problem. Quadrun and other games that try to produce speech using it don't play nice with the way I'm sampling audio on the DS. Stella is as close to cycle-accurate as an emulator can get... StellaDS is not - and the limitations of the platform are starting to show (well... that and my moderate skill level!!).

 

@Thomas Jentzsch

After profiling a number of ARM-assisted games, this is the set of "fake" instructions I added to the decode logic:

 

      add2_0, add2_1, add2_2, add2_3, add2_4, add2_5, add2_6, add2_7,
      b1_000_neg, b1_000_pos, b1_100_neg, b1_100_pos, b1_200, b1_300, b1_400, b1_500, b1_600, b1_700,
      b1_800, b1_900, b1_a00, b1_b00, b1_c00, b1_d00, b1_e00, b1_f00,
      b2_pos, b2_neg,
      cmp1_0, cmp1_1, cmp1_2, cmp1_3, cmp1_4, cmp1_5, cmp1_6, cmp1_7,
      ldr4_0, ldr4_1, ldr4_2, ldr4_3, ldr4_4, ldr4_5, ldr4_6, ldr4_7,

 

These were heavy-hitters. Where you see _neg or _pos, that decodes the branch/jump logic as a positive jump or negative jump which allows for very fast processing:

 

rb=(inst & 0xFF); // Positive Jump

rb=(inst | 0xFFFFFF00); // Negative Jump (sign extend)

 

Where you see _0, _1, etc. that just decodes the register logic so avoids the shift and mask. 

 

When those are being done a million times per second, it adds up!

 

The other big improvement was elimination of the PC logic to just use a fast fetching pointer for the instruction and decode and avoiding the overhead of calling the function every time:

 

  while (1)
  {
      uInt16 inst = *thumb_ptr++;
      Thumbulator::Op decoded = (Thumbulator::Op)*thumb_decode_ptr++;
      
      switch (decoded)
      {

        ....

 

In the few cases where I need the PC (register 15), I just patch it up:

 

#define FIX_R15_PC reg_sys[15] = ((u32) (thumb_ptr - rom) << 1) + 3;

 

Lastly, I combined the ZN flags into one variable since we set those bits orders of magnitude more than we check them...

 

#define do_znflags(x) ZNflags=(x)

 

So then I just need to look for the high-bit for N and any zero value for Z.

 

I hope this helps in some small way.

Edited by llabnip
  • Like 1
Link to comment
Share on other sites

3 hours ago, llabnip said:

think the fast music fetchers are used in combo with the sort of direct audio output mode of the TIA?

 

Correct - to play back digital audio AUDC0 is set to 0, then AUDV0 is updated periodically (ideally once per scanline) with the 4-bit digital audio sample.  For speech the samples are just unpacked (2 per byte) and played back, while for music the sample is generated on the fly using waveform addition.  Pitfall 2 uses the same technique for its music.

 

If Quadrun doesn't sound good, then none of these will either.

 

Here's an old discussion that might help:

 

 

  • Like 1
Link to comment
Share on other sites

Once per scanline ... yeah, that's probably my problem. I'm sampling and it's async to the scanlines so it's going to sound like crap. I might be able to change that.

 

But on the good-news-front: I figured out my bug with CDJF+ and GORF is now working! Sounds good, looks amazing and is running at the right speed :)

 

I'll have the CDFJ+ driver checked in to the daily build tonight.  Right now I can only support CDFJ+ at 32K/8K RAM. The ROM is no problem (I can handle 512K ROM as I do for some of the banking schemes) but the 8K fast memory is wired into the Thumbulator core and it's not easy to disentangle without a significant loss of speed. 

 

But CDFJ+ at 32K/8K is still pretty good for the old handheld!

  • Like 2
Link to comment
Share on other sites

4 hours ago, llabnip said:

@Thomas Jentzsch

After profiling a number of ARM-assisted games, this is the set of "fake" instructions I added to the decode logic:

 

      add2_0, add2_1, add2_2, add2_3, add2_4, add2_5, add2_6, add2_7,
      b1_000_neg, b1_000_pos, b1_100_neg, b1_100_pos, b1_200, b1_300, b1_400, b1_500, b1_600, b1_700,
      b1_800, b1_900, b1_a00, b1_b00, b1_c00, b1_d00, b1_e00, b1_f00,
      b2_pos, b2_neg,
      cmp1_0, cmp1_1, cmp1_2, cmp1_3, cmp1_4, cmp1_5, cmp1_6, cmp1_7,
      ldr4_0, ldr4_1, ldr4_2, ldr4_3, ldr4_4, ldr4_5, ldr4_6, ldr4_7,

I found lsl1 to be (slightly) more frequent than cmp1 and ldr4.

4 hours ago, llabnip said:

These were heavy-hitters. Where you see _neg or _pos, that decodes the branch/jump logic as a positive jump or negative jump which allows for very fast processing:

 

rb=(inst & 0xFF); // Positive Jump

rb=(inst | 0xFFFFFF00); // Negative Jump (sign extend)

 

Where you see _0, _1, etc. that just decodes the register logic so avoids the shift and mask. 

How about pre-decoding one extra register for certain instructions? E.g. rb for b1 and b2? That should have more impact than differentiating between 8 add2s.

4 hours ago, llabnip said:

When those are being done a million times per second, it adds up!

For sure! Especially because the usable CPU power of the ARM chip is still ~50x higher than the 6507 can deliver.

4 hours ago, llabnip said:

The other big improvement was elimination of the PC logic to just use a fast fetching pointer for the instruction and decode and avoiding the overhead of calling the function every time:

 

  while (1)
  {
      uInt16 inst = *thumb_ptr++;
      Thumbulator::Op decoded = (Thumbulator::Op)*thumb_decode_ptr++;
      
      switch (decoded)
      {

        ....

 

In the few cases where I need the PC (register 15), I just patch it up:

 

#define FIX_R15_PC reg_sys[15] = ((u32) (thumb_ptr - rom) << 1) + 3;

I checked your code already. Not sure if I will replicate this for Stella now. As of now it is fast enough. But when it should become required (e.g. emulating an UNO-cart game, with the CPU clocked 150+ MHz), I will come back to this.

4 hours ago, llabnip said:

Lastly, I combined the ZN flags into one variable since we set those bits orders of magnitude more than we check them...

 

#define do_znflags(x) ZNflags=(x)

 

So then I just need to look for the high-bit for N and any zero value for Z.

Smart. It seems that it might makes sense to do the same for C and V.

4 hours ago, llabnip said:

I hope this helps in some small way.

It adds up. 😀:)

Edited by Thomas Jentzsch
  • Like 1
Link to comment
Share on other sites

14 hours ago, llabnip said:

So just for laughs and to bring this to a logical conclusion... I did try to jam in CDFJ+ support and tried the only game demo ROM I have: Gorf (Demo V3)

 

image.png.a9eec0fa190a940b9a9edffb75e2aa9b.png

 

 

Does Gorf use the fast music fetchers?  I see that it uses the offset for LDA fast fetchers (and does not appear to use LDX or LDY fast fetchers - if this is not correct, please let me know and I'll dig in more!).

 

 

Looks good!  FYI Gorf doesn't use the music fetchers and it uses the default offset of 0 for the fast fetchers and doesn't use LDX or LDY.  Gorf Arcade doesn't actually need CDFJ+; I converted it to CDFJ+ because it has the best memory map layout IMO and I wanted to have it ready to expand to 64K if needed.  For some reason it doesn't work on the RetroN 77 (crashes); I know it's at v6.6 and CDFJ+ support is in 6.7, but I thought maybe since it didn't use a FF offset <> 0 and didn't use LDX, LDY it may work. :| It's most likely because of the changes it needed to support 32K ROM/8 K RAM. :ponder:  

10 minutes ago, llabnip said:

But on the good-news-front: I figured out my bug with CDJF+ and GORF is now working! Sounds good, looks amazing and is running at the right speed :)

Awesome! :thumbsup:  

10 minutes ago, llabnip said:

I'll have the CDFJ+ driver checked in to the daily build tonight.  Right now I can only support CDFJ+ at 32K/8K RAM. The ROM is no problem (I can handle 512K ROM as I do for some of the banking schemes) but the 8K fast memory is wired into the Thumbulator core and it's not easy to disentangle without a significant loss of speed. 

Have you tried the Turbo Arcade demo?  It's CDFJ+, 64K ROM / 8K RAM.  The retail version of Qyx is actually CDFJ+, but this uses LDX/LDY FF since I needed it for the status area.  I'll put together an updated demo so you can test that one too.  I'm still trying to figure out the DSi; my son hasn't used it in 9 years and forgot the parental code passcode and the secret question answer lol.

10 minutes ago, llabnip said:

But CDFJ+ at 32K/8K is still pretty good for the old handheld!

Totally amazing - great job! :thumbsup: 

  • Like 1
Link to comment
Share on other sites

21 hours ago, llabnip said:

Cool! So if it's really a DSi, the SD card slot should be on the right side covered up by a little flap.  The slot on the back is the cartridge port.  

21 hours ago, llabnip said:

If it's an older DS, it will not have the SD slot on the right side.  The only way to make that play games is via a flash cart - they are cheap (an R4i clone cart is like $25) but they will only play the more basic games (4K/8K and some 16K/32K games that aren't too taxing). The DSi has a 2x CPU that is unlocked for homebrews via the SD card.

:lol: yes it's really a DSi, turns out I was holding it backwards :dunce:  

21 hours ago, llabnip said:

Most of us follow the guide here:  https://dsi.cfw.guide/get-started.html#requirements 

 

I use the "Memory Pit" exploit that basically replaces one file on the SD card that is activated when your DSi "takes a picture" - this is the foot in the door that is needed to launch the custom menu.

 

I also use Twilight Menu++ as my menu launcher of choice.

 

With that, it's just a matter of downloading the StellaDS.nds (or any of the other emulators I worked on) and placing your ROMs wherever you want (recommend /roms/2600 as it will start looking there). 
 

Thanks for the link!  I'll dig up an SD card and see what I can come up with. :) 

21 hours ago, llabnip said:

If you struggle, I’ll offer to build you an SD card and ship it free of charge. I owe you at least that for your great ports over the years!

 

Wow - thanks so much for the offer!  If I can't figure this out I'll send you a PM and we'll come up with a plan.  I'd like to get this DSi thing running so I can add it into my test plan when I release new games (and I can go back and optimize some of my old ones too, most notably Lady Bug Arcade).

 

Thanks!

John

 

Link to comment
Share on other sites

Checking Stella's changelog for Quadrun:

 

July 17, 2004 - Stella release 1.4: "Digital sound support (used in games like Quadrun and Pitfall2) has been greatly improved. Sound generation is now more tightly synchronized with video updates."

 

October 26, 2012 - Stella release 3.7.3: "Improved sound generation with ROMs that have irregular scanline counts. This fixes many demo ROMs as well as Quadrun, where previously there would be 'gaps' in the sound output.". There's also "Fixed bug in DPC+ bankswitch scheme; the music in several ROMS wasn't playing correctly."

 

February 21, 2013 - Stella release 3.8: "Selecting more common sample rates (other than 31400) now works much better, but there are still a few ROMS (like Quadrun) where 31400Hz still works best."

 

Around 3.7.3 we'd have been working on Stay Frosty 2, so the music is probably better than what StellaDS currently generates. Stella 5.0 had major TIA/6502/RIOT changes, so might make sense to compare StellaDS's TIA audio routines with those in 4.x source.  

 

I noticed 5.0 also had: "Fixed long-standing bug in 3-voice music in DPC+ bankswitching scheme; the music now sounds much more like the real thing." I suspect that bug fixe does not depend on the TIA updates.  

  • Like 1
Link to comment
Share on other sites

8 minutes ago, Thomas Jentzsch said:

@llabnip Where did you get your Thumbulator code from? It looks quite different than the one we based on our code in Stella.

StellaDS was originally based on Stella 1.4. and I updated to 3.7.x (more or less) with some key fixes from later releases.  I still use the old TIA hacks rather than the cycle accurate TIA. That cycle accurate TIA comes with a huge performance penalty that the DS can't handle.

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...