Jump to content
IGNORED

Gopher2600 (continuing development on Github)


JetSetIlly

Recommended Posts

  • 2 weeks later...

I've made some significant performance improvements to Gopher2600 this week. I can't promise it'll be fast enough for everyone but on my machine there is about a 9% improvement in FPS for a normal 2600 ROM and about 20% improvement for a typical example of a ROM using the ARM chip.

 

The improvements are a combination of TIA streamlining (recognising that some conditions can be eliminated if some other condition is true/false) and removing a counter-productive memory reallocation when the ARM program is executed.

 

https://github.com/JetSetIlly/Gopher2600/releases/tag/v0.12

 

I've also made some improvements to the CRT emulation. Scaling the image is now limited to whole steps (so 3x or 4x, etc.). This prevents color and size banding noticeable on some ROMs if the scaling factor was not a whole number. I've also controls to adjust the sharpness of the image and the fineness of the shadowmask and scanlines.

 

A bilinear filter is now applied to the source CRT texture. I discovered this by accident when experimenting with scaling methods but I've found that it enhances the brick effect in zookeeper very nicely indeed. This probably isn't news to anyone but me ?

 

image.png.ed0550027e7456ce6b082bb366a89062.png

 

Finally, screen roll. There are no settings to adjust this yet but the screen will desynchronise "correctly" when a VSYNC comes too late. Recovery to a stable image takes a second or two. I hadn't thought about screen roll originally but was encountering the need for it more and more now that ARM cycle-counting is in place.

 

 

 

  • Like 2
Link to comment
Share on other sites

59 minutes ago, JetSetIlly said:

...now that ARM cycle-counting is in place.

Have you been able to verify your counts? I have added cycle counts to Stella lately too and wonder how precise they can become. 

 

Issues:

  • I found no clear cut details in cycle count documentation what happens in case of a branch. I currently assume 1 cycle if not taken and 3 cycles if taken. Do you have different info or can you confirm this?
  • The timing will change when MAM is disabled. Then Flash memory access will require extra cycles. Old Harmony carts have a CPU which has a MAM bug and therefore it is disabled by default. Do you have any info on how this affects the cycle count? Edit: I just read that you are addressing this already. :thumbsup:
Edited by Thomas Jentzsch
Link to comment
Share on other sites

10 minutes ago, Thomas Jentzsch said:

Have you been able to verify your counts? I have added cycle counts to Stella lately too and wonder how precise they can become. 

 

Issues:

  • I found no clear cut details in cycle count documentation what happens in case of a branch. I currently assume 1 cycle if not taken and 3 cycles if taken. Do you have different info or can you confirm this?
  • The timing will change when MAM is disabled. Then Flash memory access will require extra cycles. Old Harmony carts have a CPU which has a MAM bug and therefore it is disabled by default. Do you have any info on how this affects the cycle count? Edit: I just read that you are addressing this already. :thumbsup:

I'll summarise my understanding as best as I can. Apologies if you already know this.

 

There are four types of cycles: I cycles, S cycles and N cycles. The forth type, C cycles, can be ignored in our case.

 

I cycles are unaffected by memory speed and run at the ARM clock rate. N and S cycles can be "stretched" according to the underlying speed of the memory being addressed.

 

For Gopher2600, I've added the cycle profile for each of the 19 instruction groups. During execution of an instruction I make a count of the I, N and S cycles. Crucially, I count N and S cycles according to whether it was a PC fetch or a "data fetch" (this information is in the ARM7TDMI data sheet)

 

Once an instruction has completed I apply the appropriate stretching for the memory type. I make an assumption here that all data fetches are from SRAM and all PC fetches are from the memory area pointed to by the PC value at the end of the instruction. Both of these are reasonable assumptions I think.

 

If MAM is enabled I assume that the caching is "perfect" and all accesses occur at SRAM speed. How the MAM works is probably the area where the most improvement can be found but for now it seems to work okay.

 

On the subject of conditional branching: the ARM7TDMI data sheet says it takes two S cycles and one N cycle (both PC bound) so that's three cycles at the speed of the underlying memory pointed to by the PC. If the branch alters the PC from one memory area to the other then the cycles might stretch differently but I haven't bothered modelling that. (a) it's unlikely and (b) the 6502 probably wouldn't even notice the difference (unless it happens a lot).

 

There's nothing in the documentation that indicates cycle usage is different if the branch is successful or not.

 

 

Link to comment
Share on other sites

21 minutes ago, Andrew Davie said:

Working well for me. Currently about 52 fps on my ARM-intensive project with all the CRT options disabled to get the extra bit of speed :)

MacBook early 2013, 2.6 GHz dual-core i5

Thanks for the updates!

Disabling CRT effects shouldn't make any difference. All the CRT processing is done on the GFX chip so assuming you have a GFX chip the performance impact is negligible. I'm interested if you're experiencing anything different.

 

Note that I have a GTX 650 in my development machine which is the same vintage as your MacBook Pro. What spec GFX chip does the Pro have?

Link to comment
Share on other sites

11 minutes ago, JetSetIlly said:

Once an instruction has completed I apply the appropriate stretching for the memory type. I make an assumption here that all data fetches are from SRAM and all PC fetches are from the memory area pointed to by the PC value at the end of the instruction. Both of these are reasonable assumptions I think.

Agreed, valid assumptions. How are your stretchings defined? Integers?

 

Also, couldn't you apply the stretching later on and only once? That would save some execution time, no?

8 minutes ago, JetSetIlly said:

There's nothing in the documentation that indicates cycle usage is different if the branch is successful or not.

The doc says:

"When the condition code of any instruction is not met, the instruction is not executed. An unexecuted instruction takes one cycle."

 

In case of a conditional branch not taken IMO this implies that it takes only 1 cycle, no?

Link to comment
Share on other sites

3 minutes ago, Thomas Jentzsch said:

Agreed, valid assumptions. How are your stretchings defined? Integers?

Floats.

 

3 minutes ago, Thomas Jentzsch said:

 

Also, couldn't you apply the stretching later on and only once? That would save some execution time, no?

 

 

The doc says:

"When the condition code of any instruction is not met, the instruction is not executed. An unexecuted instruction takes one cycle."

 

In case of a conditional branch not taken IMO this implies that it takes only 1 cycle, no?

 

We must have different documentation. This is from the ARM7TDMI-S technical reference manual https://developer.arm.com/documentation/ddi0234/b

image.thumb.png.615a13260a5160c89540d7c14d1447f9.png

 

Link to comment
Share on other sites

Just now, Thomas Jentzsch said:

I have the same doc. The questions is, if this is a conditional instruction. Then it is not always executed and would take only one S cycle if not executed.

That makes sense and I think you're right. In ARM mode all instructions can be conditional but in Thumb mode it's only the branch which is conditional. I hadn't considered the possibility of conditionality and took that section at face value. Cheers.

  • Like 1
Link to comment
Share on other sites

57 minutes ago, JetSetIlly said:

Disabling CRT effects shouldn't make any difference. All the CRT processing is done on the GFX chip so assuming you have a GFX chip the performance impact is negligible. I'm interested if you're experiencing anything different.

 

Note that I have a GTX 650 in my development machine which is the same vintage as your MacBook Pro. What spec GFX chip does the Pro have?

 

Intel HD Graphics 4000 1536 MB

 

I was incorrect about the effects affecting performance. I turned them on and fps is still about the same.

That's good news! I guess my earlier tests were being affected by something else.

 

 

 

  • Like 1
Link to comment
Share on other sites

15 minutes ago, Thomas Jentzsch said:

You are welcome.

 

And another question: I notice that in case of a shift operation you only add the I-cycle if the shift is > 0. From which doc did you get this?

 

Hmm. Good question. If we look at the ARM equivalent instruction which is the MOV instruction, then the format of that instruction suggests that a shift happens when the shift bits are non-zero. If the bits are zero then a shift does not happen.

 

The cycle information for the MOV instruction meanwhile says that the I cycle isn't required unless there is a shift. A bit pattern of zero means no shift, so no I cycle is required.

 

Therefore, if we take at face value the equivalence of Thumb mode LSL/LSR and ARM mode MOV, then that says to me that LSL/LSR instruction with shift bits of zero do not require the additional I cycle.

 

I may be overthinking it and I'm prepared to be wrong, but that was my interpretation.

Link to comment
Share on other sites

4 minutes ago, Thomas Jentzsch said:

Looks like we have to speculate here. Or maybe do some testing on real hardware to find out.

 

The big question for me now is the access speed of SRAM and Flash memory as found in the Harmony. I've instrumented the settings so they can be changed on the fly but a good default is required.

Link to comment
Share on other sites

1 hour ago, JetSetIlly said:

The big question for me now is the access speed of SRAM and Flash memory as found in the Harmony. I've instrumented the settings so they can be changed on the fly but a good default is required.

Do already have full MAM emulation implemented?

 

I doubt that SRAM is slow at all. And it seems that with MAM = 2 Flash memory access speed becomes mostly irrelevant, S (always?) and N (mostly) are already in the latches. Only (far) branches may cause delays due to memory access. However with MAM=1 (N only) and especially with MAM = 0 (S and N), Flash memory access can play a major role.

 

I found 50ns access time for LPC2103 Flash, which equals 4 cycles at 70 MHZ. Which will have a major impact given an average instruction time of ~2 cycles (based on opcode frequency sampling of some ARM games). This increases to ~4 (MAM = 1) and more than 7 (MAM = 0) cycles.

 

These are worst case values, with Flash access only. 

Edited by Thomas Jentzsch
Link to comment
Share on other sites

1 hour ago, Thomas Jentzsch said:

Do already have full MAM emulation implemented?

 

Sort of. I've made the assumption that the caching is perfect and runs at the same speed as SRAM. Which, as you say is negligible. No more than 10ns I suspect.

 

Currently, you can have MAM turned on by default (meaning the driver has turned it on for you); or allow the Thumb program to turn it on (which as I understand some versions of the Harmony do not allow this). By default, I have the MAM active at the start of program execution, which is good for the very recent Champ games.

 

image.png.2375ef645384eb7eca59b896adc6a787.png

 

1 hour ago, Thomas Jentzsch said:

 

I doubt that SRAM is slow at all. And it seems that with MAM = 2 Flash memory access speed becomes mostly irrelevant, S (always?) and N (mostly) are already in the latches. Only (far) branches may cause delays due to memory access. However with MAM=1 (N only) and especially with MAM = 0 (S and N), Flash memory access can play a major role.

 

Are you sure that you can discount S cycles if MAM is active? I've not found good information about the MAM but I'm assuming that an S cycle would take 1 unstretched ARM cycle . Is it documented anywhere that you can ignore S cycles?

 

Link to comment
Share on other sites

4 minutes ago, JetSetIlly said:

Are you sure that you can discount S cycles if MAM is active? I've not found good information about the MAM but I'm assuming that an S cycle would take 1 unstretched ARM cycle . Is it documented anywhere that you can ignore S cycles?

Sorry, I didn't mean you can completely ignore it. But it should be almost always cached by MAM, so no Flash penalties there. Of course it still takes 1 CPU cycle then.

 

BTW: The documentation suggests setting MAMTIM to 3 CCLKS (not 4 as I wrote above) for CPU speeds > 40 MHz.

Link to comment
Share on other sites

23 hours ago, Thomas Jentzsch said:

Sorry, I didn't mean you can completely ignore it. But it should be almost always cached by MAM, so no Flash penalties there. Of course it still takes 1 CPU cycle then.

 

BTW: The documentation suggests setting MAMTIM to 3 CCLKS (not 4 as I wrote above) for CPU speeds > 40 MHz.

 

Can you tell me some about the MAM as it exists in the Harmony? What is the nature of the bug exactly?

 

Reading comments from @johnnywc I know the latest drivers put the Chip in MAM mode 2 by default but what about earlier drivers? Does it initialise in MAM mode 0 or MAM mode 1?

 

In old Harmony cartridges, is it only mode 2 that you can't enter from the Thumb program or can you not enter mode 1 either?

 

Link to comment
Share on other sites

9 minutes ago, Thomas Jentzsch said:

According to the errata sheet, the bug is (better: was) only in MAM 2. So I suppose MAM 1 is enabled. @batari should know best.

 

If it's enabled by default on entering the Thumb program then even that is a significant difference to MAM 0.

 

@batari if you could confirm that MAM 1 is set by the driver I would be very grateful.

 

Edited by JetSetIlly
Link to comment
Share on other sites

50 minutes ago, JetSetIlly said:

If it's enabled by default on entering the Thumb program then even that is a significant difference to MAM 0.

Yes, it makes a big difference if an S-cycle takes 1.25 ((7*1 + 1*3)/8) or 3 cycles. 

 

My latest calculation (average cycles per instruction):

  • MAM 0: ~5.5 cycles
  • MAM 1: ~3.6 cycles
  • MAM 2: ~2.5 cycles 

The numbers are for LPC2103, which only has one flash bank. E.g. a LPC2104 has two interleaved banks, so especially sequential fetches will hardly have any latch misses.

 

 

Edited by Thomas Jentzsch
Link to comment
Share on other sites

2 hours ago, JetSetIlly said:

Reading comments from @johnnywc I know the latest drivers put the Chip in MAM mode 2 by default but what about earlier drivers? Does it initialise in MAM mode 0 or MAM mode 1?


All the drivers use MAM mode 2 for driver itself for performance reasons. The driver code is running in RAM, which does not trigger the bug.

 

The game's C code runs from Flash memory (the "ROM"), so can trigger the MAM bug with the older Melody/Harmony. The random crashes had become a major issue when working on Stay Frosty 2 and Frantic, I was almost ready to abort those projects when we figured out the issue.

 

From the errata sheet, no longer at the link but quoted here:

Quote

MAM.2: Under certain conditions in MAM Mode2 code execution out of internal Flash can fail
 
Introduction:
The MAM block maximizes the performance of the ARM processor when it is running code in Flash memory. It includes three 128-bit buffers called the Prefetch Buffer, the Branch Trail Buffer and the data buffer. It can operate in 3 modes; Mode 0 (MAM off), Mode 1 (MAM partially enabled) and Mode 2 (MAM fully enabled).
 
Problem:
Under certain conditions when the MAM is fully enabled (Mode 2) code execution from internal Flash can fail. The conditions under which the problem can occur is dependent on the code itself along with its positioning within the Flash memory.
 
Work-around:
If the above problem is encountered then Mode 2 should not be used. Instead, partially enable the MAM using Mode 1.

 

In DPC+ projects the game's code sets MAM 1 to prevent the crashing.  From Stay Frosty 2:

 

#define MAMCR *(unsigned char*)0xE01FC000

int main()
{
	MAMCR=1;    // partially disable cache due to hardware bug that can crash game

	switch (SUB)
	{
		// ... run function by 6507 code
	}

	MAMCR=2;    // reenable cache, speed needed for DPC+ driver and it doesn't trigger the bug
	return 0;
}

 

In BUS, CDF, and CDFJ the driver's code sets MAM to 1 before it calls main(), so the games code no longer changes it.

 

.set  MAMBASE,    0xE01FC000      /* Memory Accelerator Module (MAM) address */

  /* Disable MAM (due to bug executing code from Flash with MAM enabled) */
  ldr     r2, =MAMBASE
  mov     r3, #1
  str     r3, [r2]

  /* Jump to user code */
  mov     lr, pc
  mov     pc, #CSTART

 

As of CDFJ+ the driver no longer sets MAM because the bug is not a factor in the newer boards.  I think, but am not sure, that @johnnywc has a one-off build of the CDFJ which does not change MAM to 1.

  • Thanks 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...