Jump to content
IGNORED

Any info on StellaDS?


AgentOrange96

Recommended Posts

19 minutes ago, Thomas Jentzsch said:

Did you try PGO builds? For Stella (MSVC) PGO builds result into ~35% extra frames (and a 20% smaller executable). I wouldn't except that much here, but even 10% would be valuable.

It wouldn't build. I'm right up against my memory limits on the old handheld... the linker threw about 1.7 million errors (approximated - I didn't count).

 

19 minutes ago, Thomas Jentzsch said:

Have you checked my suggestion of predecoding this for branches?

rb=(inst>>0)&0xFF;
if(rb&0x80) rb|=0xFFFFFF00; // Sign extend

//plus: 
rb++;

I pre-decode all branch instructions that are hit with any frequency... positive and negative.

 

          case Op::b1_000_pos:  //B(1) conditional branch
                if (!ZNflags)
                {
                    rb=(inst & 0xFF)+1;
                    thumb_ptr += (int)rb;
                    thumb_decode_ptr += (int)rb;
                }
              break;
              
          case Op::b1_000_neg:  //B(1) conditional branch
                if (!ZNflags)
                {
                    rb=(inst | 0xFFFFFF00)+1;
                    thumb_ptr += (int)rb;
                    thumb_decode_ptr += (int)rb;
                }
              break;



I did try the experiment to change the actual instruction so I don't have to do the AND or the +1 (that is: just do that on the pre-decode) but it requires a further third buffer since the memory is not all ARM instructions (some of it is data and some is 6502 assembly) - but we decode everything. So I did make a copy of the ROM for decode + alter the instruction to eliminate the AND but that didn't help as much as I'd hoped so I backed off it as I really didn't want to waste the space.

 

Edit: one other thing I found to make a nice speedup was knowing if I was reading from RAM or ROM. For example, any read based on reg 13 (SP) is always going to be RAM. So having a RAM-only read or write that can just mask-and-go is a speedup. Even better for ROM access as no masking needs to be done.  Those were happening on the order of 1 million per second.

 

I also tried to have a memory pointer that is 0x40000000 lower than the base RAM address so I didn't even have to mask for RAM access and that works for all games except Draconian which must have some strange way of accessing RAM (MASKING works for that game but just an offset of 0x40000000 does not - it produced strange colors). So I backed off that change.

Edited by llabnip
Link to comment
Share on other sites

3 minutes ago, llabnip said:

It wouldn't build. I'm right up against my memory limits on the old handheld... the linker threw about 1.7 million errors (approximated - I didn't count).

Too bad. I suppose you cannot cross compile on a different platform using PGO?

3 minutes ago, llabnip said:

I did try the experiment to change the actual instruction so I don't have to do the AND or the +1 (that is: just do that on the pre-decode) but it requires a further third buffer since the memory is not all ARM instructions (some of it is data and some is 6502 assembly) - but we decode everything. So I did make a copy of the ROM for decode + alter the instruction to eliminate the AND but that didn't help as much as I'd hoped so I backed off it as I really didn't want to waste the space.

How about changing the cart buffer instead (or too)? 

  1. Change the opcode in the cart buffer (saves RAM)
  2. Change the instruction so that it can be processed faster

If you keep the decoded buffer, you can even use the now unused space of the the cart buffer for modifying the code.

Link to comment
Share on other sites

12 minutes ago, Thomas Jentzsch said:

How about changing the cart buffer instead (or too)? 

  1. Change the opcode in the cart buffer (saves RAM)
  2. Change the instruction so that it can be processed faster

If you keep the decoded buffer, you can even use the now unused space of the the cart buffer for modifying the code.

The cart buffer contains a mix of stuff that is instructions and non-instructions. But the decode runs across all 16-bit "opcodes" (meaning it will be decoding some 6502 crap as Thumb instructions and some data as well).  That's why I needed the third buffer that was only an ARM instruction buffer...  and it worked, saving some AND and the occasional +1 but it didn't have a huge impact (I got about 0.5 additional frames of performance on the more stubborn ARM games for that effort) and in the end I abandoned further efforts.

 

I've had a bit more luck in the TIA processing area with rendering colors and collisions (or lack of collisions for the CDFJ games - none of which seem to use the TIA collision handling). I may spend some more time in there looking for cycles.

Link to comment
Share on other sites

3 minutes ago, llabnip said:

The cart buffer contains a mix of stuff that is instructions and non-instructions. But the decode runs across all 16-bit "opcodes" (meaning it will be decoding some 6502 crap as Thumb instructions and some data as well). 

I know. But you could identify ARM code at runtime, decoding once, on-the-fly. And then you can change all the 16 bits. And you only need one bit for marking code as decoded Though that will probably to slow for checking, then the current 8 bit will do and the unused 7 bits could be utilized too.

3 minutes ago, llabnip said:

or lack of collisions for the CDFJ games - none of which seem to use the TIA collision handling.

I would not rely on that. But you could decide per game.

Link to comment
Share on other sites

5 minutes ago, Thomas Jentzsch said:

I know. But you could identify ARM code at runtime, decoding once, on-the-fly. And then you can change all the 16 bits. And you only need one bit for marking code as decoded Though that will probably to slow for checking, then the current 8 bit will do and the unused 7 bits could be utilized too.

I would not rely on that. 

Ah... ok. You're on another level from me... I mostly just try things and see what works :) 

 

All DPC+ games are now full speed (even Space Rocks Tourney will hold 60 most of the time).

The CDFJ/CDFJ+ games are mostly full speed... with some holdouts that I'm working through. But not bad for 134MHz little ARM processor!

 

5 minutes ago, Thomas Jentzsch said:

But you could decide per game.

The default for DPC+ games is to use collision except for Space Rocks and Scramble... but it can be configured OFF on a per-game basis.

The default for CDFJ/CDFJ+ games is to not use collision... but it can be configured ON on a per-game basis.

Edited by llabnip
  • Like 1
Link to comment
Share on other sites

34 minutes ago, llabnip said:

Ah... ok. You're on another level from me... I mostly just try things and see what works :) 

Maybe, but only slightly. And I am sure you can easily get there too.

34 minutes ago, llabnip said:

 

All DPC+ games are now full speed (even Space Rocks Tourney will hold 60 most of the time).

The CDFJ/CDFJ+ games are mostly full speed... with some holdouts that I'm working through. But not bad for 134MHz little ARM processor!

That's not "not bad", but totally impressible! :thumbsup: (and now you only need a little further step to get everything going at full speed)

34 minutes ago, llabnip said:

The default for DPC+ games is to use collision except for Space Rocks and Scramble... but it can be configured OFF on a per-game basis.

The default for CDFJ/CDFJ+ games is to not use collision... but it can be configured ON on a per-game basis.

Perfect. :) 

  • Like 1
Link to comment
Share on other sites

After playing a number of amazing games from @johnnywc ... the 192 vertical pixel limitation of the DS is starting to show.  Many classic games are 192... or maybe 200 or 205 with a lot of "sky" or "ground" that don't really matter if they get cut-off (for example, on ancient TVs that don't handle a lot of overscan). 

 

But Champ Games tends to utilize 208 pixel lines which means there are 16 pixel lines that get cut off. 

 

StellaDS has a per-game offset that can be adjusted so these pixel lines can be missing from the top or bottom or a combination... and there is a screen scaling option but that just means that the DS hardware will decide what lines to drop when rendering a frame.  That's great for 98.3% of the games - but some of the newer homebrews really push the visuals and it's difficult to just cut things off. Often I'm finding I have to leave off the entire score display - it doesn't really matter for gameplay but it's nice to know where you are when you're playing!  The alternative is to show your score but cut off the bottom where the ship/lives/fuel/status is.

Tonight I experimented with a new new idea that allows the user to change any of the DS buttons as a scroll/offset button. This allows you to tap a button and the screen will scroll up/down 16 pixels and after a half-second smooth scroll it back to your normal position.  So far this has been great! It allows me to see the entire playfield for all modern games and when there is a brief pause in the action, I can tap a button to glance up at the score. It's taken almost no time to mentally get used to this and it should be a great option to render these amazing new games as best possible on the smaller handheld.

 

Here you can see my normal Ladybug Arcade playfield (left image) and then after I tap the scroll up it briefly shows me the top status stuff (right image).  I should have this out in tomorrow's daily build.

image.thumb.png.6123ba2236b2339ffd6ee18445ff72cd.png

  • Like 4
Link to comment
Share on other sites

6 hours ago, llabnip said:

Tonight I experimented with a new new idea that allows the user to change any of the DS buttons as a scroll/offset button. This allows you to tap a button and the screen will scroll up/down 16 pixels and after a half-second smooth scroll it back to your normal position.  So far this has been great! It allows me to see the entire playfield for all modern games and when there is a brief pause in the action, I can tap a button to glance up at the score. It's taken almost no time to mentally get used to this and it should be a great option to render these amazing new games as best possible on the smaller handheld.

Could you react to a change in the currently invisible area and scroll there automatically (as per game or user option)? With a maximum frequency, e.g. every 1 second if both areas change frequently.

Edited by Thomas Jentzsch
  • Like 1
Link to comment
Share on other sites

10 hours ago, llabnip said:

Ah... ok. You're on another level from me... I mostly just try things and see what works :) 

To stay on topic: :) 

 

I have tested on-the-fly decoding in Stella, and it works perfectly. You can manipulate the ROM without affecting 6502 code.

 

While doing that, I found that there are quite frequent opcode combinations (e.g. cmp1 + b1, add2 + cmp1 + b1). I suppose one could profit from this somehow, e.g. by pre-decoding and using the flags resulting from cmp more directly. Might become quite complex though.

  • Like 1
Link to comment
Share on other sites

4 hours ago, Thomas Jentzsch said:

Could you react to a change in the currently invisible area and scroll there automatically (as per game or user option)? With a maximum frequency, e.g. every 1 second if both areas change frequently.

Yes... but as it's normally the score at the top of the screen, it would get pretty jarring pretty quickly.  

 

The configured DS button to pan up briefly is working quite well and I'm finding it pretty natural. Most games have a few breaks between stages or when you die that you can hit the button to peek at your score.  I've yet to see even on of Champ Games utilize more than 192 scanlines for the actual gameplay playfield :)

  • Like 1
Link to comment
Share on other sites

11 hours ago, Thomas Jentzsch said:

I have tested on-the-fly decoding in Stella, and it works perfectly. You can manipulate the ROM without affecting 6502 code.

Tried it at lunch... works but the check for decode destroys my main Thumb loop as it executes on every itteration.

 

So I just used a separate buffer (duplicating the ARM code) - for now that limits me to 128K of ARM Thumb code (still twice the size of Turbo Arcade) and took about 40 of the most heavily hit opcodes and reworked the instruction so we don't have to shift or mask and it really only amounted to a half-frame of performance. Kinda disappointing.  In the end, I felt the juice wasn't worth the squeeze and decided to not try and modify the original ARM instructions. The split-out instructions in the DecodeRom[] handler really did make a great difference and that's great that you pointed me to it!

 

However, I did find a DS specific improvement to keep my cache-lines full got me another full frame.  For the first time I've seen 60 fps on Draconian (it flickers 58/59/60 so I had to take several pics to get this!)

image.png.4b093803f646ed6a3f55d71a5f41183a.png

Edited by llabnip
  • Like 1
Link to comment
Share on other sites

9 minutes ago, llabnip said:

Tried it at lunch... works but the check for decode destroys my main Thumb loop as it executes on every itteration.

Not sure what that means.

 

To me it looks like this:

uInt16 inst = *thumb_ptr++;
Thumbulator::Op decoded = (Thumbulator::Op)*thumb_decode_ptr++;

would have to be reworked into about this (untested!):

Thumbulator::Op decoded = (Thumbulator::Op)*thumb_decode_ptr;
if(decoded == Op::invalid)
  decoded = (Thumbulator::Op)*thumb_decode_ptr = decodeInstructionWord(*thumb_ptr);
thumb_decode_ptr++
uInt16 inst = *thumb_ptr++;

 

Yes, there is some minimal overhead (that extra if), but I think you could regain this by modifying the ROM (e.g. putting rb of B(1) and B(2) into it).

 

What am I missing?

Link to comment
Share on other sites

1 minute ago, Thomas Jentzsch said:

Yes, there is some minimal overhead (that extra if), but I think you could regain this by modifying the ROM (e.g. putting rb of B(1) and B(2) into it).

That conditional if() is executed more than a million times per second. It dropped my Thumbulator performance a full 5+%.  LBA, for example, went from 57 to 53 fps. 

 

Even without the if(), my experiments showed my up-side is no more than 0.5 frames of improved performance on the more complex rendering games like Draconian or LBA - but the extra conditional makes that a net loss of more than 3 frames of performance across the board.

 

 

  • Sad 1
Link to comment
Share on other sites

2 minutes ago, llabnip said:

That conditional if() is executed more than a million times per second. It dropped my Thumbulator performance a full 5+%.  LBA, for example, went from 57 to 53 fps. 

Wow, that's hefty. My own experiments on a PC platform show very different results. As soon as most currently used code is decoded (usually after one frame) the overhead is hardly measurable. I suppose its a platform and/or compiler thing.

2 minutes ago, llabnip said:

Even without the if(), my experiments showed my up-side is no more than 0.5 frames of improved performance on the more complex rendering games like Draconian or LBA - but the extra conditional makes that a net loss of more than 3 frames of performance across the board.

True, if you lose 5% initially, you cannot make up that loss. Too bad. :sad: 

Link to comment
Share on other sites

1 minute ago, Thomas Jentzsch said:

Wow, that's hefty. My own experiments on a PC platform show very different results. As soon as most currently used code is decoded (usually after one frame) the overhead is hardly measurable. I suppose its a platform and/or compiler thing.

True, if you lose 5% initially, you cannot make up that loss. Too bad. :sad: 

Yeah, I went so far as to pre-decode and left that logic in which means that it would NEVER test true for the Op:Invalid and it was still the same loss. As you said, it works out all the opcodes it needs in a very short period of time - and after that it's just the overhead of the conditional.

 

The DS and ARM9 has some further complications in that everything is related to 32-byte cache lines. If you access memory that isn't in a cache, it will cache 32-bytes around that memory. A cache miss is 10x more costly than a cache hit... and adding even one line of code can shift things enough that it's possible the loss is magnified on my target platform.

 

I'm using that to my advantage, however. One trick I'm pulling is that with some games, RAM is scattershot... meaning the game reads from here, there and everywhere. That stuff works great if I emulate the ram from the DS VRAM which is not as fast as cache but much faster than a cache miss... and scattershot produces a lot of cache misses.

 

But DPC+ Fast Fetchers and CDF/J/+ tend to move things around in blocks... so that same trick doesn't have the same utility as there are more sequential accesses.

 

It's all a bit frustrating really... but fun at the same time!

 

  • Like 1
Link to comment
Share on other sites

38 minutes ago, llabnip said:

Yeah, I went so far as to pre-decode and left that logic in which means that it would NEVER test true for the Op:Invalid and it was still the same loss. As you said, it works out all the opcodes it needs in a very short period of time - and after that it's just the overhead of the conditional.

I think what you describe now might explain it:

38 minutes ago, llabnip said:

The DS and ARM9 has some further complications in that everything is related to 32-byte cache lines. If you access memory that isn't in a cache, it will cache 32-bytes around that memory. A cache miss is 10x more costly than a cache hit... and adding even one line of code can shift things enough that it's possible the loss is magnified on my target platform.

Sure, if the loop size exceeds the cache, then you will have repeated cache misses. But actually the code in the loop should become smaller, since you do not have to calculate rb and you also do not need separate opcodes for _pos and _neg branching. Maybe it is an alignment thing? But I suppose you are already aligning your data to 32-bit boundaries.

38 minutes ago, llabnip said:

I'm using that to my advantage, however. One trick I'm pulling is that with some games, RAM is scattershot... meaning the game reads from here, there and everywhere. That stuff works great if I emulate the ram from the DS VRAM which is not as fast as cache but much faster than a cache miss... and scattershot produces a lot of cache misses.

Pretty smart.

38 minutes ago, llabnip said:

It's all a bit frustrating really... but fun at the same time!

Sometimes you win, sometimes you lose. As long as you win more often, everything is fine. And pretty satisfying.

  • Like 1
Link to comment
Share on other sites

As I try to come to grips with the last of the CDFJ/+ games that need a bit more speed (Gorf 55-60, LBA 57-60, Draconian 53-60 and Turbo Arcade 45-55) I stumbled upon something interesting.

 

Gorf has a value set for the CDFJ+ Fast Fetcher offset. It uses 992d (3E0h). However, that memory location always contains zero.  But since there is an offset, the emulation has to fetch another 8-bit value and then do a comparison against a lower and upper bound (where the upper bound is the original fetcher offset of 33h plus that offset).  It's not crushing my emulation but it does cause a 2 frame per second penalty. 

 

Turbo Arcade has no such offset... that one is just doing a ton of memory moves.  I don't think having a 64K/8K RAM version will help much... the RAM isn't the bottleneck on this one. Still digging in...

 

@johnnywc

 

 

As an aside: I sure do wish the fast fetchers area was a power of 2.  Testing for 31 or less with a mask is faster than testing for 33 or less :)

Edited by llabnip
Link to comment
Share on other sites

2 hours ago, llabnip said:

 

 

Gorf has a value set for the CDFJ+ Fast Fetcher offset. It uses 992d (3E0h). However, that memory location always contains zero.  But since there is an offset, the emulation has to fetch another 8-bit value and then do a comparison against a lower and upper bound (where the upper bound is the original fetcher offset of 33h plus that offset).  It's not crushing my emulation but it does cause a 2 frame per second penalty. 

 

Turbo Arcade has no such offset... that one is just doing a ton of memory moves.  I don't think having a 64K/8K RAM version will help much... the RAM isn't the bottleneck on this one. Still digging in...

 

@johnnywc

 

I think this is because Gorf is using the latest CDFJ+ driver that has the LDX/LDY FF and DD offset enhancements; the Turbo Arcade demo uses the original CDFJ+ driver that didn't have these implemented so there won't be a DD offset.  With that said, I don't need the CDFJ++ enhancements in Gorf, so it may be as simple as swapping in the old CDFJ+ driver and recompiling (or just copy the first 2K from Turbo to the first 2K of Gorf) and you should get a 2 frame increase in speed. 🐰  🐢 

 

I decided to port Gorf to CDFJ+ during the summer in case I needed to go > 32K (plus I like the memory layout better) but I didn't have to, and I don't use LDX, LDY or the DD offset so it should work with the older driver.  🤞

 

re: Turbo, yes it's doing a ton of mem copies each frame as well as some on the fly decompressing so it's pushing the limits.  If I went with > 8K it would make sense for me to decompress the data once which would speed things up considerable (similar to the update I did for LB:A).  I haven't worked on Turbo in over a year so once I dig back into it I'll start looking at how I can optimize it a bit. :) 

  • Like 1
Link to comment
Share on other sites

Ok, thanks for the explanation. I can easily add an override in configuration to not apply any offsets. The gain is marginal... but even two frames of performance at these slim margins is a win :) 

Actually - I wonder if there's a way to scan the ROM to see if that offset in RAM is ever touched. Hmmm.... Anyway, adding an option is easy enough. Even if you did change back there are other developers who will just use the latest driver even if they aren't using all the features. I already have an option for the ARM to enable unsafe optimizations, no TIA collisions and now this will be another option to ignore fast fetcher offsets to yield the greatest output rendering speed.

 

Edit: okay... twas easy to add as an option and it's in. Easiest 2 frames of performance gain ever on this ARM stuff :) 

Edited by llabnip
Link to comment
Share on other sites

27 minutes ago, llabnip said:

Ok, thanks for the explanation. I can easily add an override in configuration to not apply any offsets. The gain is marginal... but even two frames of performance at these slim margins is a win :) 
 

:idea: Actually, now I recall why I used the latest CDFJ+ driver.  It's actually been modified to work with the 48 pin Melody boards so it can run on the Harmony Encore.  There are 2 drivers, one is cdfjplus48.bin that you use if your game is 32K and 8K of RAM or cdfjplus64.bin if you're using 64K or greater and 16K of RAM.  Games using this driver cannot run on the Harmony or Encore.

 

Each of these drivers are then modified by hand (although I think @SpiceWare's templates do it automatically for you) to update particular bytes in the driver to enable LDX/LDY FF and set the DD offset.  

27 minutes ago, llabnip said:

Actually - I wonder if there's a way to scan the ROM to see if that offset in RAM is ever touched. Hmmm....

 

:idea: Well, if they're not using this offset the value should be 0 in the driver.  The developer will manually modify the driver if they need this offset to be something other than 0.  Not sure if that helps.

27 minutes ago, llabnip said:

 

Anyway, adding an option is easy enough. Even if you did change back there are other developers who will just use the latest driver even if they aren't using all the features. I already have an option for the ARM to enable unsafe optimizations, no TIA collisions and now this will be another option to ignore fast fetcher offsets to yield the greatest output rendering speed.

Good points, and as it turns out I can't use the old driver anyway since I want Gorf to run on the Harmony carts. :)  Good idea to add in the option to ignore the FF offset! :thumbsup:  

Link to comment
Share on other sites

25 minutes ago, johnnywc said:

Each of these drivers are then modified by hand (although I think @SpiceWare's templates do it automatically for you)

 

The 6507 code does this to include the driver:

 

;===============================================================================
; Define Start of Cartridge
;----------------------------------------
;   CDFJ+ cartridges must start with the Harmony/Melody driver.  The driver is
;   the ARM code that emulates the CDFJ+ coprocessor.
;
;   The configure_cdfjplus.h file will validate the Project Configuration, 
;   and include the appropriate version of the CDFJ+ driver for this project.
;===============================================================================

        SEG CODE    
        ORG $0000
    
HM_DRIVER:
    include configure_cdfjplus.h

 

 

After validating the configuration and setting additoinal values, the last thing configure_cfjplus.h does is:

 

;===============================================================================
; CDFJ+ Driver
;---------------------------------------- 
;   Include the appropriate driver based on ROM SIZE and the
;   Fast Fetcher configuration
;===============================================================================

   IF (_PC_ROM_SIZE = 32 && _PC_CDFJ_FF = 0)
         INCBIN cdfjplus_driver/cdfjplus48A_20220131.bin
  ENDIF        
  IF (_PC_ROM_SIZE = 32 && _PC_CDFJ_FF = 1)
         INCBIN cdfjplus_driver/cdfjplus48AX_20220131.bin
  ENDIF        
  IF (_PC_ROM_SIZE = 32 && _PC_CDFJ_FF = 2)
         INCBIN cdfjplus_driver/cdfjplus48AY_20220131.bin
  ENDIF        
  IF (_PC_ROM_SIZE = 32 && _PC_CDFJ_FF = 3)
         INCBIN cdfjplus_driver/cdfjplus48AXY_20220131.bin
  ENDIF        
 
  IF (_PC_ROM_SIZE >= 64 && _PC_CDFJ_FF = 0)
         INCBIN cdfjplus_driver/cdfjplus64A_20220131.bin
  ENDIF        
  IF (_PC_ROM_SIZE >= 64 && _PC_CDFJ_FF = 1)
         INCBIN cdfjplus_driver/cdfjplus64AX_20220131.bin
  ENDIF        
  IF (_PC_ROM_SIZE >= 64 && _PC_CDFJ_FF = 2)
         INCBIN cdfjplus_driver/cdfjplus64AY_20220131.bin
  ENDIF        
  IF (_PC_ROM_SIZE >= 64 && _PC_CDFJ_FF = 3)
         INCBIN cdfjplus_driver/cdfjplus64AXY_20220131.bin
  ENDIF        

 

  • Like 1
Link to comment
Share on other sites

16 minutes ago, SpiceWare said:

 

The 6507 code does this to include the driver:

 

;===============================================================================
; Define Start of Cartridge
;----------------------------------------
;   CDFJ+ cartridges must start with the Harmony/Melody driver.  The driver is
;   the ARM code that emulates the CDFJ+ coprocessor.
;
;   The configure_cdfjplus.h file will validate the Project Configuration, 
;   and include the appropriate version of the CDFJ+ driver for this project.
;===============================================================================

        SEG CODE    
        ORG $0000
    
HM_DRIVER:
    include configure_cdfjplus.h

 

 

After validating the configuration and setting additoinal values, the last thing configure_cfjplus.h does is:

 

;===============================================================================
; CDFJ+ Driver
;---------------------------------------- 
;   Include the appropriate driver based on ROM SIZE and the
;   Fast Fetcher configuration
;===============================================================================

   IF (_PC_ROM_SIZE = 32 && _PC_CDFJ_FF = 0)
         INCBIN cdfjplus_driver/cdfjplus48A_20220131.bin
  ENDIF        
  IF (_PC_ROM_SIZE = 32 && _PC_CDFJ_FF = 1)
         INCBIN cdfjplus_driver/cdfjplus48AX_20220131.bin
  ENDIF        
  IF (_PC_ROM_SIZE = 32 && _PC_CDFJ_FF = 2)
         INCBIN cdfjplus_driver/cdfjplus48AY_20220131.bin
  ENDIF        
  IF (_PC_ROM_SIZE = 32 && _PC_CDFJ_FF = 3)
         INCBIN cdfjplus_driver/cdfjplus48AXY_20220131.bin
  ENDIF        
 
  IF (_PC_ROM_SIZE >= 64 && _PC_CDFJ_FF = 0)
         INCBIN cdfjplus_driver/cdfjplus64A_20220131.bin
  ENDIF        
  IF (_PC_ROM_SIZE >= 64 && _PC_CDFJ_FF = 1)
         INCBIN cdfjplus_driver/cdfjplus64AX_20220131.bin
  ENDIF        
  IF (_PC_ROM_SIZE >= 64 && _PC_CDFJ_FF = 2)
         INCBIN cdfjplus_driver/cdfjplus64AY_20220131.bin
  ENDIF        
  IF (_PC_ROM_SIZE >= 64 && _PC_CDFJ_FF = 3)
         INCBIN cdfjplus_driver/cdfjplus64AXY_20220131.bin
  ENDIF        

 

Thanks Darrell, I figured you were doing something like that.  :thumbsup:  Looks like these handle which registers to enable for FF; how do you handle updating the driver with the DD offset?

Link to comment
Share on other sites

1 hour ago, johnnywc said:

Thanks Darrell, I figured you were doing something like that.  :thumbsup:  Looks like these handle which registers to enable for FF; how do you handle updating the driver with the DD offset?

 

 

defines_cdfjplus.h has this:

 

unsigned char* _FF_OFFSET=(unsigned char*)0x400003E0;

 

which points to the offset value in the CDFJ+ driver.

 

The C routines include the header, and if the _PC_FF_OFFSET is defined and non-zero it'll modify the driver at run-time.

 

void Initialize()
{
    int i;
    
    // When powered up the 4K of Display Data RAM will contain random values,
    // so zero it out
    myMemset(RAM, 0, 4096);
    
    // likewise the datastream increments will be random, so set them to 1.0
    for(i=0;i<=34;i++)
        setIncrement(i,1,0);
    
#ifdef _PC_FF_OFFSET
#if _PC_FF_OFFSET != 0
    
    *_FF_OFFSET = _PC_FF_OFFSET;
    
#endif
#endif
}

 

  • Thanks 1
Link to comment
Share on other sites

2 minutes ago, SpiceWare said:

 

 

defines_cdfjplus.h has this:

 

unsigned char* _FF_OFFSET=(unsigned char*)0x400003E0;

 

which points to the offset value in the CDFJ+ driver.

 

The C routines include the header, and if the _PC_FF_OFFSET is defined and non-zero it'll modify the driver at run-time.

 

void Initialize()
{
    int i;
    
    // When powered up the 4K of Display Data RAM will contain random values,
    // so zero it out
    myMemset(RAM, 0, 4096);
    
    // likewise the datastream increments will be random, so set them to 1.0
    for(i=0;i<=34;i++)
        setIncrement(i,1,0);
    
#ifdef _PC_FF_OFFSET
#if _PC_FF_OFFSET != 0
    
    *_FF_OFFSET = _PC_FF_OFFSET;
    
#endif
#endif
}

 

Makes sense, thanks Darrell!  Of course you could just modify the .bin directly and save 4 bytes (or how much ROM that assignment takes up), but this is much easier to do. :thumbsup:  I assume you could have done the same to update the driver to enable/disable LDX LDY FF as well (at the cost of some more ROM for the assignment).  

 

While I have your attention ;), do you know why Stella 6.6 can't run a CDFJ+ game like Gorf, even if it doesn't use LDX, LDY FF and has a DD offset of 0?  I understand that functionality was added into 6.7 but I was hoping Gorf would as least run on 6.6 since it does support CDFJ+ (or at least I thought it did).  I'm asking because I wanted to play Gorf on my R77 and the Stella image for that is at 6.6 and Gorf crashes on it (so does Qyx.   Perhaps it doesn't run because of the changes that were made in the cdffplus driver for the 48-pin compatibility. :ponder: 

 

 

 

Link to comment
Share on other sites

3 minutes ago, SpiceWare said:

 

 

defines_cdfjplus.h has this:

 

unsigned char* _FF_OFFSET=(unsigned char*)0x400003E0;

 

which points to the offset value in the CDFJ+ driver.

So I see exactly that for Gorf... but that's not the FF offset, it's the memory location where the FF offset is... right?

 

And so I see 3E0 as the FF Offset pointer but the memory location of ARM_RAM[3E0] is always zero (since John is not using a FF offset).

 

Who would write the 3E0 memory location? Can I assume that if there is a zero in memory location 3E0 it will never change? I assumed the FF Offset was changeable at run-time.

 

In other news, I found that a Fast Fetch LDA is often followed by STA. It's 15:1 in Gorf that it's an LDA-STA pair.  It's only 5:1 for Turbo.   But doing a "look ahead" and seeing a LDA-STA pairs and processing them together actually produces a nice speedup! Gorf got almost 2 frames of performance. Draconian as well.  Galagon gets a frame of performance as does LBA.  Turbo Arcade doesn't even get a full frame of performance... but it's not a loss!

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...