Jump to content

JetSetIlly

Members
  • Posts

    763
  • Joined

  • Last visited

Posts posted by JetSetIlly

  1. 3 hours ago, SpiceWare said:

     

    If you defined an array like this:

     

    
    unsigned char arena_color[4] =
    {
        _RED + 4,
        _GREEN + 4,
        _BLUE + 4,
        _WHITE
    };

    without the const then the array is defined as being RAM, not ROM. The compiler sets up the initial values of _RED + 4, ..., _WHITE in a data section in ROM.  The very first time you call custom ARM code a one-time process will run that copies those initial values from the ROM data section to the appropriate location in RAM.

     

    The section(".data") bit of __attribute__ taps into those routines to copy the function from ROM to RAM for you.

     

    Yes. I was thinking from an emulation point of about why this works without any change to the driver. And it's because these functions are copied from ROM to RAM as part of the .data section, which is already emulated (in Stella, etc.). Very nice solution for time critical code.

     

  2. This is incredibly useful. PrepAreaBuffers() appears top be at 0x000011b0 while RAM_PrepAreaBuffers() appears to be at 0x4000185c. There's no copying of the custom program to somewhere in RAM, it's just the variable block. Interesting.

     

     

  3. The changes I've made today around cycle accuracy and how the MAM works is worth packaging up I think.

     

    Summary:

    - MAM now differentiates between mode 1 and 2
    - MAM set according to cartridge mapper (ie. DPC+ or CDF*) [thanks @SpiceWare]
    - Counting of conditional branches corrected  [thanks @Thomas Jentzsch]

     

    https://github.com/JetSetIlly/Gopher2600/releases/tag/v0.12.1

     

    And thanks to @Al_Nafuur for help with the Windows binary.

    • Like 3
  4. 1 hour ago, Thomas Jentzsch said:

    But if you reading from a fixed table, your can also be reading from flash, no?

     

    Good point. Currently the assumption that access to SRAM is only made in the case of PUSH, POP and LDMIA and STMIA, which I think is reasonable. All other loads and stores are stretched for Flash memory. The current MAM mode alters that but in principle most read/writes are stretched for Flash.

     

    But as you said in a previous post, the available documentation for cycle usage isn't very good, so I'm making a lot of assumptions ?

     

    At the moment, I'm happy if the timings work out so that (a) existing ROMs work (or don't work) as expected and (b) that it's helpful for developing new ROMs. Which for me, means causing a screen roll or running past the end of memory if the Thumb program runs too long.

     

     

  5. 22 minutes ago, SpiceWare said:

     

    With DPC+ using the custom ARM code is not required.  I did a quick test of DK Arcade and it's running 6507 code in bank 0, which suggests its not using ARM code (or using so little that's its been fully tested and doesn't trigger the bug).

     

    It's definitely running ARM code. So we can assume that it just so happens not to trigger the bug.

     

    16 minutes ago, Thomas Jentzsch said:

    You are the expert, not I. :) 

     

    That's easy.

     

    @JetSetIlly I suppose we won't need any assumption about what is RAM and what is flash anymore.

     

    I'm assuming that the PC will never jump from one area to the other and that data read/writes are always from SRAM. I'm already measuring if the read/write is from RAM or Flash.

     

    21 minutes ago, Thomas Jentzsch said:

    The errata sheet says:

    The bug seems only to occur for latched, non-sequential data in flash. Maybe the exact condition is very rare, so that no all games are affected.

     

    Right. That's probably enough information to have a linting check in the emulator. (Probably not needed now that CDF* is availble but it's a nice feature).

     

    Thanks @SpiceWare and @Thomas Jentzsch for the info and helping me think things through.

    • Like 1
  6. 9 minutes ago, SpiceWare said:

     

    All the drivers use MAM mode 2 for driver itself for performance reasons. The driver code is running in RAM, which does not trigger the bug.

     

    Yes.

     

    9 minutes ago, SpiceWare said:

    In DPC+ projects the game's code sets MAM 1 to prevent the crashing.  From Stay Frosty 2:

     

    Right. This might be where my confusion has arisen. Many DPC+ARM ROMs I have seen do as you describe. However, there are some which do not.

     

    For example, the ROM I have of DK Arcade (link below) does not do this.

     

    So are we saying that this ROM may crash on the hardware if it triggers the bug?

     

     

    Either way, I understand now.

     

    9 minutes ago, SpiceWare said:

    In BUS, CDF, and CDFJ the driver's code sets MAM to 1 before it calls main(), so main() no longer changes it.

     

    9 minutes ago, SpiceWare said:

    As of CDFJ+ the driver no longer sets MAM because the bug is not a factor in the newer boards.  I think, but am not sure, that @johnnywc has a one-off build of the CDFJ which does not change MAM to 1.

     

    Right. So it remains in MAM mode 2. That tallies with what I've read in the Turbo demo thread.

     

    Let me summarise my understanding:

     

    1. The DPC+ driver does not change the MAM mode from mode 2

        1a) The Thumb program must therefore change to MAM mode 1 or risk a crash

        1b) The Thumb program must change back to MAM mode 2 before exit

    2. The CDF and CDFJ drivers handle the MAM changes (so the Thumb program doesn't have to)

    3. The CDFJ+ driver leaves MAM in mode 2

     

     

    Thanks.

     

  7. 9 minutes ago, Thomas Jentzsch said:

    According to the errata sheet, the bug is (better: was) only in MAM 2. So I suppose MAM 1 is enabled. @batari should know best.

     

    If it's enabled by default on entering the Thumb program then even that is a significant difference to MAM 0.

     

    @batari if you could confirm that MAM 1 is set by the driver I would be very grateful.

     

  8. 23 hours ago, Thomas Jentzsch said:

    Sorry, I didn't mean you can completely ignore it. But it should be almost always cached by MAM, so no Flash penalties there. Of course it still takes 1 CPU cycle then.

     

    BTW: The documentation suggests setting MAMTIM to 3 CCLKS (not 4 as I wrote above) for CPU speeds > 40 MHz.

     

    Can you tell me some about the MAM as it exists in the Harmony? What is the nature of the bug exactly?

     

    Reading comments from @johnnywc I know the latest drivers put the Chip in MAM mode 2 by default but what about earlier drivers? Does it initialise in MAM mode 0 or MAM mode 1?

     

    In old Harmony cartridges, is it only mode 2 that you can't enter from the Thumb program or can you not enter mode 1 either?

     

  9. 1 hour ago, Thomas Jentzsch said:

    Do already have full MAM emulation implemented?

     

    Sort of. I've made the assumption that the caching is perfect and runs at the same speed as SRAM. Which, as you say is negligible. No more than 10ns I suspect.

     

    Currently, you can have MAM turned on by default (meaning the driver has turned it on for you); or allow the Thumb program to turn it on (which as I understand some versions of the Harmony do not allow this). By default, I have the MAM active at the start of program execution, which is good for the very recent Champ games.

     

    image.png.2375ef645384eb7eca59b896adc6a787.png

     

    1 hour ago, Thomas Jentzsch said:

     

    I doubt that SRAM is slow at all. And it seems that with MAM = 2 Flash memory access speed becomes mostly irrelevant, S (always?) and N (mostly) are already in the latches. Only (far) branches may cause delays due to memory access. However with MAM=1 (N only) and especially with MAM = 0 (S and N), Flash memory access can play a major role.

     

    Are you sure that you can discount S cycles if MAM is active? I've not found good information about the MAM but I'm assuming that an S cycle would take 1 unstretched ARM cycle . Is it documented anywhere that you can ignore S cycles?

     

  10. 15 minutes ago, Thomas Jentzsch said:

    You are welcome.

     

    And another question: I notice that in case of a shift operation you only add the I-cycle if the shift is > 0. From which doc did you get this?

     

    Hmm. Good question. If we look at the ARM equivalent instruction which is the MOV instruction, then the format of that instruction suggests that a shift happens when the shift bits are non-zero. If the bits are zero then a shift does not happen.

     

    The cycle information for the MOV instruction meanwhile says that the I cycle isn't required unless there is a shift. A bit pattern of zero means no shift, so no I cycle is required.

     

    Therefore, if we take at face value the equivalence of Thumb mode LSL/LSR and ARM mode MOV, then that says to me that LSL/LSR instruction with shift bits of zero do not require the additional I cycle.

     

    I may be overthinking it and I'm prepared to be wrong, but that was my interpretation.

  11. Just now, Thomas Jentzsch said:

    I have the same doc. The questions is, if this is a conditional instruction. Then it is not always executed and would take only one S cycle if not executed.

    That makes sense and I think you're right. In ARM mode all instructions can be conditional but in Thumb mode it's only the branch which is conditional. I hadn't considered the possibility of conditionality and took that section at face value. Cheers.

    • Like 1
  12. 3 minutes ago, Thomas Jentzsch said:

    Agreed, valid assumptions. How are your stretchings defined? Integers?

    Floats.

     

    3 minutes ago, Thomas Jentzsch said:

     

    Also, couldn't you apply the stretching later on and only once? That would save some execution time, no?

     

     

    The doc says:

    "When the condition code of any instruction is not met, the instruction is not executed. An unexecuted instruction takes one cycle."

     

    In case of a conditional branch not taken IMO this implies that it takes only 1 cycle, no?

     

    We must have different documentation. This is from the ARM7TDMI-S technical reference manual https://developer.arm.com/documentation/ddi0234/b

    image.thumb.png.615a13260a5160c89540d7c14d1447f9.png

     

  13. 21 minutes ago, Andrew Davie said:

    Working well for me. Currently about 52 fps on my ARM-intensive project with all the CRT options disabled to get the extra bit of speed :)

    MacBook early 2013, 2.6 GHz dual-core i5

    Thanks for the updates!

    Disabling CRT effects shouldn't make any difference. All the CRT processing is done on the GFX chip so assuming you have a GFX chip the performance impact is negligible. I'm interested if you're experiencing anything different.

     

    Note that I have a GTX 650 in my development machine which is the same vintage as your MacBook Pro. What spec GFX chip does the Pro have?

  14. 10 minutes ago, Thomas Jentzsch said:

    Have you been able to verify your counts? I have added cycle counts to Stella lately too and wonder how precise they can become. 

     

    Issues:

    • I found no clear cut details in cycle count documentation what happens in case of a branch. I currently assume 1 cycle if not taken and 3 cycles if taken. Do you have different info or can you confirm this?
    • The timing will change when MAM is disabled. Then Flash memory access will require extra cycles. Old Harmony carts have a CPU which has a MAM bug and therefore it is disabled by default. Do you have any info on how this affects the cycle count? Edit: I just read that you are addressing this already. :thumbsup:

    I'll summarise my understanding as best as I can. Apologies if you already know this.

     

    There are four types of cycles: I cycles, S cycles and N cycles. The forth type, C cycles, can be ignored in our case.

     

    I cycles are unaffected by memory speed and run at the ARM clock rate. N and S cycles can be "stretched" according to the underlying speed of the memory being addressed.

     

    For Gopher2600, I've added the cycle profile for each of the 19 instruction groups. During execution of an instruction I make a count of the I, N and S cycles. Crucially, I count N and S cycles according to whether it was a PC fetch or a "data fetch" (this information is in the ARM7TDMI data sheet)

     

    Once an instruction has completed I apply the appropriate stretching for the memory type. I make an assumption here that all data fetches are from SRAM and all PC fetches are from the memory area pointed to by the PC value at the end of the instruction. Both of these are reasonable assumptions I think.

     

    If MAM is enabled I assume that the caching is "perfect" and all accesses occur at SRAM speed. How the MAM works is probably the area where the most improvement can be found but for now it seems to work okay.

     

    On the subject of conditional branching: the ARM7TDMI data sheet says it takes two S cycles and one N cycle (both PC bound) so that's three cycles at the speed of the underlying memory pointed to by the PC. If the branch alters the PC from one memory area to the other then the cycles might stretch differently but I haven't bothered modelling that. (a) it's unlikely and (b) the 6502 probably wouldn't even notice the difference (unless it happens a lot).

     

    There's nothing in the documentation that indicates cycle usage is different if the branch is successful or not.

     

     

  15. I've made some significant performance improvements to Gopher2600 this week. I can't promise it'll be fast enough for everyone but on my machine there is about a 9% improvement in FPS for a normal 2600 ROM and about 20% improvement for a typical example of a ROM using the ARM chip.

     

    The improvements are a combination of TIA streamlining (recognising that some conditions can be eliminated if some other condition is true/false) and removing a counter-productive memory reallocation when the ARM program is executed.

     

    https://github.com/JetSetIlly/Gopher2600/releases/tag/v0.12

     

    I've also made some improvements to the CRT emulation. Scaling the image is now limited to whole steps (so 3x or 4x, etc.). This prevents color and size banding noticeable on some ROMs if the scaling factor was not a whole number. I've also controls to adjust the sharpness of the image and the fineness of the shadowmask and scanlines.

     

    A bilinear filter is now applied to the source CRT texture. I discovered this by accident when experimenting with scaling methods but I've found that it enhances the brick effect in zookeeper very nicely indeed. This probably isn't news to anyone but me ?

     

    image.png.ed0550027e7456ce6b082bb366a89062.png

     

    Finally, screen roll. There are no settings to adjust this yet but the screen will desynchronise "correctly" when a VSYNC comes too late. Recovery to a stable image takes a second or two. I hadn't thought about screen roll originally but was encountering the need for it more and more now that ARM cycle-counting is in place.

     

     

     

    • Like 2
  16. 28 minutes ago, Andrew Davie said:

     

    As to the new display rendering/mode it doesn't play well at all with my particular game, alas.

    I get distracting banding down the screen.

     

     

    I've messaged you but in case your talking about the scanline effect being too heavy you can turn it down or turn it off through the CRT Preferences window, which you can open with F10. Uncheck or alter the effect strength to your taste. Pixel Perfect renders with no effects at all.

     

    image.png.0e46d37763610a814a1df626f5b01eb4.png

     

     

    • Like 1
  17. 4 minutes ago, Dionoid said:

    Wow, that "bended scanlines" TV mode is really nice.

    Cheers ?

    4 minutes ago, Dionoid said:

    I tried the latest release with a CDFJ rom, but noticed that the performance is not on par with playing on a real '2600. Could the new Go compiler help with this?

     

    We'll have to see when 1.17 is released but from what I've seen there will be a difference.

     

    But speed generally, is a problem for this emulator when compared to Stella. It's partly down to the differences between C++ and Go but a lot of it is down to my emulation method which is probably more fussy than it needs to be.

     

    These are the top performance hogs in the emulator, running the Gorf Arcade demo. As you can see the ARM emulation is a very small percentage of overall cost. It's the way I'm doing the TIA emulation which is causing the most expense.

     

    image.thumb.png.f65f96c822c10c1cc2313dd518dff09a.png

     

    I can get a more-or-less solid 60fps on my development machine (a 2012 i3) which was my goal when starting this.

     

    You can check for performance with the following:

     

    gopher2600 performance -display -fpscap=false romfile.bin

     

    and

     

    gopher2600 performance -fpscap=false romfile.bin

     

    Any difference between the -display and non-display versions tells us the basic overhead of the screen rendering, which is cut out entirely unless the -display flag is used. By my measurements there's quite a lot to gain but I'm no expert on graphics programming so I can't see how to improve it at the moment.

     

    If you're getting around 60fps normally, using the -fpscap=false option can give you a better idea of performance. Limiting the frame rate to the TV specification introduces its own set of problems so removing it from the measurement can be good.

     

    I think a good next step for me would be to run and profile the program on a different machine (with a different OS). I've only ever seen it run on this machine and I think a different rig might highlight differences I've not considered.

  18. I've packaged up recent changes as v0.11. https://github.com/JetSetIlly/Gopher2600/releases/tag/v0.11

     

    Main features in this release are better CRT shaders and the improved ARM timings. The Turbo Arcade demo will also work with this version now that I've added support for CDFJ+

     

    I was hoping the new Go compiler would be ready to use but it's not due for a few more weeks yet. When compiled with the development version of the compiler however, there is an approx 8% performance increase in Gopher2600. Not massive but still significant. This version has been compiled with 1.16.4

     

    • Like 4
  19. I've been reworking the ARM cycle counting to try to better account for all the variant hardware and expectations people have. In particular, the new Turbo Arcade demo had issues unless the emulator was in "immediate ARM execution" mode.

     

    I've gone through all the instructions and differentiated when N and S cycles are addressing "PC" addresses and "data" addresses. I'd already entered the cycle profile for each instruction group so this wasn't as much hard work as I first expected.

     

    For simplicity, I've assumed that all "data" read/writes are done in SRAM. This is not as big an assumption as if first appears because only PUSH, POP, LDMIA and STMIA ever use cycles in this way and I'm fairly confident those instructions rarely (if ever) read/write to flash.

     

    The other assumption is the MAM caching is essentially perfect and when enabled Flash memory is never touched.

     

    N and S cycles are stretched according to the speed of the memory being addressed. In previous versions I used a flat value of 2 (which was a reasonable first estimation) for N cycles only and in all instances. This led to a reasonable average result but the new version should be more accurate in more situations.

     

    I've also added some more ARM options to the preferences window. This can be summoned in playmode as well as the debugger. The full list of ARM options is now:

     

    * Immediate ARM execution - thumb program returns immediately and consumes no 6507 time

    * Default MAM Enable for Thumb Programs - assume the Harmony driver is enabling the MAM

    * Allow MAM Enable from Thumb - allow the enabling of MAM from within the thumb program. From what I understand, some editions of the Harmony do not allow this. I've added this option in case there are versions or variants which do allow it.

     

    The Timings sliders:

    * ARM Clock - the basic speed of the ARM

    * Flash Access Time and SRAM Access Time - speed in nanoseconds. The slower the memory the more stretching for N and S cycles.

     

    I'm not sure if my default speed values are correct. But these are the values that seem to hit the sweet spot for the collection of ARM ROMs I have available.

     

    I plan to do some more work on this this week. Checking for accuracy and adding some instrumentation to the debugger.

     

    Here's a short video showing the effect of changing memory speed on the Gorf Arcade title screen. Apologies for my poor screen-roll emulation - that's next on the TODO list.

     

    Source on Github.

     

    • Like 3
  20. 23 minutes ago, johnnywc said:

    Hi there!  The game does not roll on real hardware; I'm assuming the point in the game is when the hill appears or during a transition screen wipe to another screen?

     

    Yes. About halfway through the hill stage.

     

    Quote

    The ARM cycle counting thing sounds very interesting; how is it done and is it something developers can use?  Probably the most frustrating part of developing for the ARM is trying to speed optimize code without any true benchmark telling you whether or not the changes you've done are helping or not. ?  

     

    Cycle counting is part of my emulator. It differs to Stella in that Stella executes the ARM program instantly, relative to the 6507. I've tried a different approach whereby the 6507 is stalled with NOPs like in the real hardware. Cycle counting is tricky with the ARM however so it's not perfect but it is helpful to make sure the program isn't going bezerk. I hope to get the emulation to a state where it is accurate enough for optimisation work but it's not there yet.

     

    This is the link to the README where I briefly discuss ARM emulation.

     

    https://github.com/JetSetIlly/Gopher2600#arm7tdmi-emulation

     

    (Current code supports CDFJ+ but I've not prepared a binary yet)

     

    Quote

    Yes, MAM is enabled in Turbo Arcade.  The screen will roll without it (mostly because we need to update the entire 40x184 view screen on every frame from compressed data).

    MAM is enabled, but we never write to the MAM control address because Chris aka cd-w provided me with a CDFJ+ driver that has MAM enabled by support so I can save a few bytes of ROM not having to enable it in code (same for RobotWar and Gorf Arcade). :idea: :D 

     

     

    Ah. Of course. I didn't think about it being enabled in the driver.

     

    Quote

     

    It is an upgraded CPU but I'm not sure what the specs are.  @batari Fred would have all the info about that. 

     

     

    • Like 1
×
×
  • Create New...