Jump to content
IGNORED

Any info on StellaDS?


AgentOrange96

Recommended Posts

1 hour ago, Thomas Jentzsch said:

How about pre-decoding one extra register for certain instructions? E.g. rb for b1 and b2? That should have more impact than differentiating between 8 add2s.

It didn't rise to the > 100k instructions threshold that I used to determine what instructions got split. I'll re-profile to see where those fall. 

@johnnywc I assumed 64K CDFJ+ implied 16K RAM!  So on your suggestion, I tried it. Crashed badly. I think I have plenty of ROM space - but the Thumb driver is strictly limited to 8K of fast RAM (and I don't check for out-of-bounds as part of the speedups).  I'll dig in as it's probably just something goofy on my side.

Link to comment
Share on other sites

45 minutes ago, llabnip said:

@johnnywc I assumed 64K CDFJ+ implied 16K RAM!

 

I was working on a CDFJ+ framework that has you specify a few things to configure the project, and it auto-configs the rest....  

 

Found it - last worked on it in March. Wow, time flies.

 

Left is the 6507 source, where you specify things like ROM size, right is the auto-config routines that use values to set the rest.

 

791878365_ScreenShot2022-11-28at1_21_55PM.thumb.png.3400dff81327e7d5f94457f39b372dc5.png

 

 

Based on _PC_ROM_SIZE:

  • 32K ROM -> 8K RAM
  • 64K ROM -> 16K RAM
  • 128K ROM -> 16K RAM
  • 256K ROM -> 32K RAM
  • 512K ROM -> 32K RAM

I haven't worked on it since batari discovered the 70 MHz issue when launching a ROM from the Harmony cart's menu.   We have a solution, which I need to implement and test.  I need to wrap up my Synology Linux virtual machine so I can resume work on it, just got a little sidetracked by StellaDS - LOL

  • Like 1
Link to comment
Share on other sites

Thanks @SpiceWare

 

Yeah, Turbo needed 16K of emulated RAM to get running. Unfortunately, I only have 8K of the fast working RAM.

 

I swapped out to 32K of slower SRAM (I have plenty of that) and the performance of all ARM games dropped more than 10%.

 

But... Turbo Arcade did run!  Had fun at 50 fps before I revert the changes and live with 32K/8K (well... I'll support 64K/8K and 128K/8K in case that ends up being "a thing" in the future). 

 

image.png.a50a4b72c3bcf68c598a41d157a928a5.png

 

 

 

 

  • Like 2
Link to comment
Share on other sites

1 hour ago, llabnip said:

 I assumed 64K CDFJ+ implied 16K RAM!  So on your suggestion, I tried it. Crashed badly. I think I have plenty of ROM space - but the Thumb driver is strictly limited to 8K of fast RAM (and I don't check for out-of-bounds as part of the speedups).  I'll dig in as it's probably just something goofy on my side.

I think I may be wrong with my 64K/8K assumption. :P  You may be right (based on Darrell's response below), 64K may mean 16K of RAM.  I modified my Turbo Arcade custom boot to specify just 2K for the C variables (it was at 4K) and the starting address to be 0x40001800.  It compiles and the splash screen works but then it crashes in Stella.  

59 minutes ago, SpiceWare said:

 

I was working on a CDFJ+ framework that has you specify a few things to configure the project, and it auto-configs the rest....  

 

Found it - last worked on it in March. Wow, time flies.

 

Left is the 6507 source, where you specify things like ROM size, right is the auto-config routines that use values to set the rest.

 

791878365_ScreenShot2022-11-28at1_21_55PM.thumb.png.3400dff81327e7d5f94457f39b372dc5.png

 

 

Based on _PC_ROM_SIZE:

  • 32K ROM -> 8K RAM
  • 64K ROM -> 16K RAM
  • 128K ROM -> 16K RAM
  • 256K ROM -> 32K RAM
  • 512K ROM -> 32K RAM

 

Thanks for the reminder Darrell!  Do you know if you *have* to reserve 16K RAM when using 64K ROM?  Right now it looks like my memory layout reserves 2K of RAM for the CDFJ driver, 10K for DDRAM and 4K for C variables:

 

  • 0x40000000 - 0x400007FF : 2K (CDFJ+ driver)
  • 0x40000800 - 0x40002FFF : 10K (DDRAM)
  • 0x40003000 - 0x40003FFF : 4K (C RAM)

C_STACK is initialized to 0x40003FDC

 

If I can reduce my DDRAM/C usage to < 6K, could I adjust my C_STACK and custom_boot.lds to specify just 8K needed for RAM?  I tried to test over here but since I had 10K of DDRAM I was lazy when I threw Turbo together and didn't optimize space usage so it's using more than 4K of DDRAM, but I'm sure I could reduce that.

 

Also, I noticed in my Turbo source code it's using this to display the free DDRAM:

 

    ; Show Remaining Display Data
    IF (ROM_SIZE == 32)
    echo "---- DISPLAY DATA", ($1800 - *)d, "bytes free"
    ENDIF
    IF (ROM_SIZE == 64 || ROM_SIZE == 128)
    echo "---- DISPLAY DATA", ($3800 - *)d, "bytes free"
    ENDIF
    IF (ROM_SIZE == 256 || ROM_SIZE == 512)
    echo "---- DISPLAY DATA", ($7800 - *)d, "bytes free"
    ENDIF  

 

I'm not sure where I got this, but I think this it is wrong, at least for the ROM_SIZE = 32K.  Shouldn't this be ($1000 - d) to reserve 4K for DDRAM, 2K for the driver and 2K for C variables?  For 64K it seems that it's reserving 14K for DDRAM, 2K for the driver and 16K for C variables, but my .lds file has this:

 

/* Memory Areas  */
MEMORY
{
    boot (RX)   : ORIGIN = 0x2AD0    , LENGTH = 0x50    /* C-runtime booter */
    C_code (RX) : ORIGIN = 0x2B30    , LENGTH = 0xECD0  /* C code (50K) */
    ram         : ORIGIN = 0x40003000, LENGTH = 0x1000  /* 4K variables */
}

I think the ram ORIGIN should be 0x40004000, not 0x40003000, correct?  This would start at 16K, and the LENGTH should be 0x4000 for 16K of C ram?  Anyway, to stay on topic, my question was if I reduce my DDRAM usage to 4K and my C ram usage to 2K in Turbo and update the lds to match (ORIGIN=0x40001800, LENGTH=0x800) as well as the stack pointer to 0x400017FC, would it be able to run under StellaDS since it would be using the same memory space as a 32K CDFJ(+) game? :ponder: 

 

  • Like 1
Link to comment
Share on other sites

8 minutes ago, johnnywc said:

  Anyway, to stay on topic, my question was if I reduce my DDRAM usage to 4K and my C ram usage to 2K in Turbo and update the lds to match (ORIGIN=0x40001800, LENGTH=0x800) as well as the stack pointer to 0x400017FC, would it be able to run under StellaDS since it would be using the same memory space as a 32K CDFJ(+) game? :ponder: 

 

Don't worry about staying on topic... I Necro-bumped this thread and usurped it :) 

 

Obviously I'd be keenly interested in anything that would run on 64K/8K combo!!

 

Edit: but do NOT do extra work for StellaDS. There are about 100 users... it's a niche of a niche. But still.. would be cool!

Edited by llabnip
  • Like 1
Link to comment
Share on other sites

12 minutes ago, johnnywc said:

Do you know if you *have* to reserve 16K RAM when using 64K ROM?

 

I don't think you have to use it, just need to configure everything correctly for 8K.  My make process auto-creates symbols_from_dasm_for_lds.h, with the RAM size and other values, which is used by custom.boot.lds. Some of that config is also in the 6507 code, like that C_STACK (which assigned in that Configure RAM code in the prior screenshot) in the bottom-left split. There might be more for the configureation, but I don't recall as it's been a while since I last looked at this.

 

image.thumb.png.2bf3f09b6caa272a58a4ce8459bb6333.png

 

  • Thanks 1
Link to comment
Share on other sites

25 minutes ago, johnnywc said:

Also, I noticed in my Turbo source code it's using this to display the free DDRAM:

 

My framework uses this, the _PC_DD_SIZE is set by the programmer alongside _PC_ROM_SIZE:

 

    SEG.U DISPLAYDATA
    ORG $0000

_DS_TO_ARM:     
_RUN_FUNC:  ds 1        ; function to run
_SWCHA:     ds 1        ; joystick directions to ARM code
_SWCHB:     ds 1        ; console switches to ARM code
_INPT4:     ds 1        ; left firebutton state to ARM code
_INPT5:     ds 1        ; right firebutton state to ARM code

<snip>

    echo "----",(_PC_DD_SIZE - .) , "bytes of Display Data RAM left"    

 

  • Thanks 1
Link to comment
Share on other sites

33 minutes ago, llabnip said:

You've lost me. Then again, I'm easy to lose :)

 

In the meantime... 6.0b daily build is checked in with the first support for CDFJ+ up to 64K ROM and 8K RAM.

Your answer to the text you quoted made no sense to me. Maybe I am the one who is lost. 😀

  • Like 1
Link to comment
Share on other sites

4 hours ago, llabnip said:

Don't worry about staying on topic... I Necro-bumped this thread and usurped it :) 

:) 

4 hours ago, llabnip said:

Obviously I'd be keenly interested in anything that would run on 64K/8K combo!!

 

Edit: but do NOT do extra work for StellaDS. There are about 100 users... it's a niche of a niche. But still.. would be cool!

No worries, I would like to get my games to run on as many platforms as possible, as long as it's not too much effort. ;)   I did spend about 30 minutes trying to set the configuration to only use 8K max and it kind of worked : splash screen, title screen, but crash on the game screen in Stella with out of range reads.  Interestingly, it played in Gopher2600 much nicer, the game actually played but there were bugs where the same enemy car would show up over and over and the scene would never end.  I suspect memory is getting clobbered at some point.  For now we can say we support Turbo Arcade at a little bit slower speed. ;)  

  • Like 1
Link to comment
Share on other sites

17 minutes ago, johnnywc said:

For now we can say we support Turbo Arcade at a little bit slower speed. ;)  

Yeah, no worries for sure. I am working on some experiments that might help. I only have 16K of fast working RAM and 8K of that is a general purpose buffer I use for a ton of stuff... basically I can run an entire 2K/4K/8K game from that buffer as well as Starpath AR games. I also use it for the RAM on ARM-Assisted games to gain speed.  The normal Nintendo DS SRAM is relatively huge (4MB) but is slow... and worse, they connected it to the 32-bit ARM via a 16-bit bus and it introduces wait states. So the fast RAM is where the action is... and to speed things up further, I rely on this fast RAM being fixed at a specific location so it's not trivial to swap out for one game.

 

But...

 

There is also Video VRAM which is connected to a 16-bit bus but is much faster than the larger SRAM.  And the DS lets you re-purpose the VRAM for general CPU access. I have about 128K of VRAM left to play with. I'm going to try using it for the ARM Thumb RAM.  It won't be as fast as the 8K working RAM. So many of these ARM-assisted games are running right at the margins (60 fps plus or minus a few frames) and any loss is significant.  However, if the loss is only a frame or two of performance, I'll live with VRAM and scour the earth to get back those 2 lost frames of speed... that way I should be able to support 512K ROM and 128K RAM (yes, I realize those are theoretically beyond the current technical specs Spice has in his overview of the scheme... but at least I'll be prepared for the future!).

 

I still can't get Lady Bug Arcade above about 54 fps. It's really crushing the emulation!

 

 

Link to comment
Share on other sites

7 minutes ago, llabnip said:

Yeah, no worries for sure. I am working on some experiments that might help. I only have 16K of fast working RAM and 8K of that is a general purpose buffer I use for a ton of stuff... basically I can run an entire 2K/4K/8K game from that buffer as well as Starpath AR games. I also use it for the RAM on ARM-Assisted games to gain speed.  The normal Nintendo DS SRAM is relatively huge (4MB) but is slow... and worse, they connected it to the 32-bit ARM via a 16-bit bus and it introduces wait states. So the fast RAM is where the action is... and to speed things up further, I rely on this fast RAM being fixed at a specific location so it's not trivial to swap out for one game.

 

But...

 

There is also Video VRAM which is connected to a 16-bit bus but is much faster than the larger SRAM.  And the DS lets you re-purpose the VRAM for general CPU access. I have about 128K of VRAM left to play with. I'm going to try using it for the ARM Thumb RAM.  It won't be as fast as the 8K working RAM. So many of these ARM-assisted games are running right at the margins (60 fps plus or minus a few frames) and any loss is significant.  However, if the loss is only a frame or two of performance, I'll live with VRAM and scour the earth to get back those 2 lost frames of speed... that way I should be able to support 512K ROM and 128K RAM (yes, I realize those are theoretically beyond the current technical specs Spice has in his overview of the scheme... but at least I'll be prepared for the future!).

Sounds like a great plan!  I'll put the Turbo Arcade 8K task aside and focus on...

7 minutes ago, llabnip said:

I still can't get Lady Bug Arcade above about 54 fps. It's really crushing the emulation!

 

 

... getting LB:A to run a bit faster. :)  

 

I'm confused though; I think you said Gorf Arcade runs 60 fps, correct? 

 

Taking a quick snapshot in-game to display the cycles/instructions for Lady Bug Arcade shows ~64K cycles in VBLANK, ~3.6K in overscan, and ~29K instructions in VB and ~1,600 in OS:

image.thumb.png.d31878cd33f0a77975f16afa7d266a90.png

 

Gorf Arcade shows a bit higher for each (68K, ~3.8 cycles; ~31K and ~1,700 instructions:

image.thumb.png.791dd9d6c5f892073cbb5e3a9959df8e.png

 

Are there any specifics why LB:A is running so slow/what's getting hammered?  I'll take a look at the code and see if there's anything obvious, but this should be a fairly simple game to run vs. RobotWar, Mappy, etc. although I know I could do optimizations on the maze rendering if that's the culprit.

 

Thanks!

John

 

  • Like 1
Link to comment
Share on other sites

Disregard.  I did a quick dive into LB:A and the issue is that I render the maze every frame when I only have to render it once at the start of the level. If I comment out that call the # of cycles drops from 64K down to 40K. :o  I am going to spend some time modifying the code so it only renders the static part of the maze once at the start of the level and that should fix all of our problems (and also make the game run much better on the Retron77).  

  • Like 5
Link to comment
Share on other sites

On 11/25/2022 at 11:57 PM, llabnip said:
  • Not calling into the execute() for each Thumb instruction - the overhead of the call was not optimized away with GCC at the max settings so I moved the handling of the Thumb loop to inside the execute().

:idea: You can force inlining with "__attribute__((always_inline))". Unlike "inline" which is only a hint, this is respected by the compiler.

 

That should help avoiding spaghetti code.

  • Like 1
Link to comment
Share on other sites

9 hours ago, johnnywc said:

If I comment out that call the # of cycles drops from 64K down to 40K. :o 

That's huge! And will put LBA on the fully playable list for StellaDS :)

 

I did my experiments and the best combo I found was this:

 

If the ARM-Assisted code is 32K or less, use the 8K fast RAM buffer

If the ARM-Assisted code is > 32K, use a slower 32K RAM buffer (I didn't see much sense in having a 16K vs 32K possibility... if the upper 16K goes unused, that's okay)

 

All RAM is now handled through indirection - which does have a slight performance penalty but it was less than one frame across the board. I can live with that.

 

I had to give up the 512K ROM option to get this in due to some other constraints - but that's hardly a loss.  I should now be able to support ARM-Assisted games with ROM up to 256K and RAM up to 32K.  That should hold the line  for a while.

 

With that, here is where I stand with ARM-Assisted games:

 

  • All DPC+ games run at 60 fps except Scramble which will occasionally dip down to 57+ but it's not noticeable for gameplay
  • All CDF/CDFJ/CDFJ+ games run at 60 fps except:
    • Turbo Arcade (50 fps but somehow still fun... needs about 10% more speed in my driver)
    • Boom (RC2) (50 fps)
    • Lady Bug Arcade (54 but soon 60!)
    • Draconian (55 with dips and gusts as the enemies swam or leave)
    • Super Cobra (57 with gusts to 60 and fully playable)
    • Robot War dips below 60 occasionally... though I'm balls at the game so it might dip down more when the robots REALLY start swarming

 

I'm happy enough with my progress that I'm treating myself to the full Wizard of Wor Arcade game ROM... paypal sent!

 

3 hours ago, Thomas Jentzsch said:

That should help avoiding spaghetti code.

I'm Italian... I like spaghetti! :) 

  • Like 2
Link to comment
Share on other sites

20 hours ago, SpiceWare said:

 

Correct - to play back digital audio AUDC0 is set to 0, then AUDV0 is updated periodically (ideally once per scanline) with the 4-bit digital audio sample.

So I started this process with coffee this morning. I think I know what to do... only I don't want to do it :)  

 

There are two main sound engines for DS/DSi development. The one being used in StellaDS is the "standard" one that is ... quirky.  I switched to MAXMOD for some of my more recent emulators on the DS/DSi and it's much nicer - better sounding and even a bit less overhead. But it's a bit of work to replace the guts of the sound engine and I'm not yet ready to tackle it. MAXMOD will give me the flexibility to process sound on each scanline. Soon. Maybe.

 

One question - how do I easily track scanlines? I don't call into the TIA processing on scanlines whose clock does not render the line (e.g. in vertical blank before the line would be visible).  I added a counter in the Tia::poke(0x02) where the caller is asking to wait for the leading edge of HBLANK. That genreated ~15K hits pre second which I think is roughly right for NTSC... But I don't know if that's reliable (i.e if someone used up all cycles on a scanline, they wouldn't have to wait for HBLANK right?).  

 

Is there a foolproof and easy method of knowing when we get to a new (or end of) scanline?

Edited by llabnip
  • Like 2
Link to comment
Share on other sites

13 minutes ago, llabnip said:

Is there a foolproof and easy method of knowing when we get to a new (or end of) scanline?

Track cycles:  76 CPU cycles of the 6507 run per scanline.... a STA WSYNC instruction just halts the CPU for the rest of a scanline.  The only tricky thing to handle is when a CPU instruction crosses into the next scanline.

 

For video:

Yes, NTSC has a 15.7 khz line frequency.

Also, there are 228 color cycles in a scanline (68 hblank + 160 on screen).

 

 

  • Like 3
Link to comment
Share on other sites

6 hours ago, llabnip said:

That's huge! And will put LBA on the fully playable list for StellaDS :)

 

Well, I wasn't able to have the screen rendered just once without a major rewrite, but I did decouple the maze data decompression and rendering so the maze data is just decompressed once (and stored in a buffer) instead of every frame.  This reduced the average # of cycles in Stella per frame from ~64K to about 55K, so about a 14% increase.  I tried it on the R77 and it runs much smoother now; hopefully you see some improvement on StellaDS also! 🤞

 

Lady-Bug-Arcade_demo_final_v2_NTSC.bin

7 hours ago, llabnip said:

I'm happy enough with my progress that I'm treating myself to the full Wizard of Wor Arcade game ROM... paypal sent!

Wow - thanks so much!  I'll be sure to process that order right away. :D 

 

 

  • Like 4
Link to comment
Share on other sites

This kind of flew under my radar so this the first I've tried it out. Pretty awesome!   I've been digging the FPGA 2600 core on the Analogue Pocket as of late, but I really dig the touchscreen interface in here for the console switches, paddle control, game info etc. 

 

'Course I don't have any of the newer ROMs you guys have.. so I'm playing Atari Pacman :lol:  and the other usual VCS fare. It plays them great  :D

 

image1.thumb.jpg.61e8b86dced3c54346bbcefd3ef1ec0c.jpg

 

Edited by NE146
  • Like 2
Link to comment
Share on other sites

Nice @NE146!   Hard to tell as I don't know how big an Analog Pocket is... but that looks like a DSi XL (deep crimson or wine colored - quite lovely). That's the preferred handled for StellaDS (or a 2DS or 3DS) - the XL screen is a bit slower than the original DSi LCD and so the pixels fade a bit more slowly and that almost replicates the phosphor fade of a TV. 

 

If you're running from a flash cart (back slot), you will be running in DS-mode which is only 67MHz CPU and that's fine for most classic games. If you're running from the SD card (side slot) via something like Unlaunch or Twilight Menu++, then you will have unlocked the 2X CPU (and 4X RAM - though that's less critical) and you should be able to run the ARM-Assisted games and the larger/complex homebrews of the past decade. Be sure to check out John's Champ Games website for some awesome demos (which are really quite generous in not having too many limitations). 

 

I just checked in daily build 6.0e with another frame or two of performance on ARM-Assisted games. I think the Thumbulator is just about as fast as I can get it. I'm down to looking at mixed C/Assembly to try and route out any unnecessary instructions. 

Link to comment
Share on other sites

9 minutes ago, llabnip said:

If you're running from a flash cart (back slot), you will be running in DS-mode which is only 67MHz CPU and that's fine for most classic games. If you're running from the SD card (side slot) via something like Unlaunch or Twilight Menu++, then you will have unlocked the 2X CPU (and 4X RAM - though that's less critical) and you should be able to run the ARM-Assisted games and the larger/complex homebrews of the past decade. Be sure to check out John's Champ Games website for some awesome demos (which are really quite generous in not having too many limitations). 

 

 

Interesting! Who knew that stuff made a difference (I certainly didn't :lol: )  

 

Yes it's a DSi XL, and launching via SD cart. I definitely would have preferred to run on the New 3DS XL (I pretty much have every DS model for the most part) but it's always a hassle for me to figure out how to get .nds files on there. 😛 So I stick to 3DS games on the 3DSs, and DS games on the DSs/DSis whether it's jailbreak or flashcart. 

Edited by NE146
  • Like 1
Link to comment
Share on other sites

20 minutes ago, llabnip said:

I just checked in daily build 6.0e with another frame or two of performance on ARM-Assisted games.

Did you try PGO builds? For Stella (MSVC) PGO builds result into ~35% extra FPS for ARM games (and a 20% smaller executable). I wouldn't except that much here, but even 10% would be valuable.

20 minutes ago, llabnip said:

I think the Thumbulator is just about as fast as I can get it. I'm down to looking at mixed C/Assembly to try and route out any unnecessary instructions. 

Have you checked my suggestion of predecoding this for branches?

rb=(inst>>0)&0xFF;
if(rb&0x80) rb|=0xFFFFFF00; // Sign extend

//plus: 
rb++;

 

Edited by Thomas Jentzsch
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...