Jump to content

DirtyHairy

Members
  • Posts

    873
  • Joined

  • Last visited

Everything posted by DirtyHairy

  1. https://github.com/jerinjacobk/armv8_pmu_cycle_counter_el0 That's a kernel module that enables this (and a few other counter) and gives permission to unprivileged code (on aarch64). The relevant register is pmccntr_el0 . You can find more information in the ARM v8 reference manual, section D12 (performance monitor extensions). The following code for reading the counter is quoted from the readme of this module: static inline uint64_t read_pmccntr(void) { uint64_t val; asm volatile("mrs %0, pmccntr_el0" : "=r"(val)); return val; }
  2. To iterate on that, I think I have found a good solution --- the ARM performance counter. Once enabled it counts CPU cycles (so sub-nanosecond, and at a fixed rate if scaling is disabled), and reading it is a single MRS instruction to access a special register. The access is usually blocked from userspace, but access for unprivileged code can be enabled from within a small kernel module that configures the relevant special registers.
  3. @Al_Nafuur @MarcoJ @Kroko and anybody else involved 😏 How do we plan to continue? I see several possible things to try next Improve rtstella: make emulation cycles more deterministic by eliminating caching and batching optimisations. Originally, that was my next step. Classify reads in relevant and irrelevant reads und use that to eliminate bad writes due to dummy reads to the same address before writing (as in absolute indirect addressing) Try to find a working hardware timer and use that to get a lower bound on cycle time Use a RT thread on another core to get a monotonic counter and use that to get a reliable lower bound on cycle time Use an external circuit to make sure that there is sufficient delay before putting the value on the bus ...? 1., 2. 3. and 4. are pure software tasks, and 2. and 4. should improve compatibility. I could work on any of them, but I don't know what your plans are, and I cannot test myself (still have to order the necessary components). I would prefer 3 and 4, but what are your preferences?
  4. Nice, interesting. The speed seems to be good, but the pitch of the oscillator seems to be seriously off. According to Kevtris' reverse engineering this is a simple oscillator circuit with a cap and resistor. It seems the way the cart is wired up influences this circuit.
  5. Neither have I, but so agree Deadline is worth a look 😏. I also have preempt_rt on my list. Still, i think we basically have to (be able to) live with preemption.
  6. Yeah, I am using FIFO at highest priority for rtstella. But still, all of these can and will be preempted.
  7. Every 6502 cycle is a bus access. The only ones that we can likely skip if we aim for high accuracy and compatibility are multiple redundant reads to the same address. Ignoring this small number of cycles that we can safely skip on the bus we have to wait for the end of each cycle before we can do the next, so emulation essentially has to happen in lockstep with the VCS clock. Let's say the VCS' clock is 1MHz for simplicity. Then each bus cycle is 1us. If each bus cycle takes 1us, then we can at most emulate 1000000 cycles per second, which is exactly what we require for full speed. If we loose any of those cycles (and we will due to scheduling) we have absolutely no time left to catch up again, and the emulation will effectively be slower than real time. So we need some leeway to catch up with those cycles missed by scheduling. The only deterministic way of getting those is shortening the bus cycles. We can also try to get away with ignoring more cycles on the bus, but this comes at the price of reduced compatibility, and the mileage will vary from game to game.
  8. Good point. It might be with a try to spawn a realtime thread that continuously increments a shared variable. We can use std::atomic to make sure that the increment happens in one instruction and the proper barriers are inserted.
  9. Awesome! Sound seems OK, too, that's a good start. The flash carts may be harder to get running. We have to make the bus cycles shorter than a real VCS in order to keep up, and at least some Uno drivers barely fit into the ordinary cycle budget. I guess if you increase the 300 you'll get them running, but at slower-than-realtime speed.
  10. Small correction: should me "make -j2 all" . And no "sudo" --- the build (none of the steps) doesn't need to be done as root 😏
  11. 🎉 Are you using rtstella? How is performance? Is audio OK (can only be if it runs at 100% speed), do you get underruns?
  12. No need to rebuild anything, SDL2 can use ALSA (and even should use it) directly if PA is not available. You can force it by doing "export SDL_AUDIODRIVER=alsa" before running Stella.
  13. I see no alternative, other than add an auxiliary chip with >40 GPIOs that executes those phantom reads for us 🤷‍♂️ But I cannot think of any case where consecutive reads to a single ROM location would make a difference compared to prolonging the previous cycle and a single, final read. The issue, on the other hand, affects all ROM locations that are written too, and limiting us to well-known hotspots would reduce compatibility.
  14. I have done a few experiments with Dig Dug, Aardvark and 6502.ts. Double access to a SC write location is *extremely* common --- it happens on pretty much every indexed absolute access. The answer is what I described above, but I made a mistake: it doesn't require the offset to be zero, but any offset that does not cross a page boundary will do! Dig Dig is full of such accesses, the first happening on 0x1110 in bank 3. I guess many other games use indexed absolute addressing for the SC RAM as well. As we cannot guarantee short cycle times due to the scheduler (even with rtstella, which is not what is currently in @Al_Nafuur's git), this will fail with the Kroko, and maybe also with real hardware, depending on when it latches the value into RAM. I don't see why the Uno/Plus should care, though. This could be worked around in software by labelling memory accesses as irrelevant or relevant. The cart code could batch up consecutive irrelevant reads to the same address in ROM space (emulation can continue as the result is irrelevant) and only play out the last access. This is a slight departure from exact emulation, but I can't see how it would cause any fallout.
  15. I checked the 6502.ts source, and indexed writes have the same behaviour: read lo, read hi, dummy read (ignoring a potential carry from a page cross), write (using the correct address). So, if the index is zero, two consecutive cycles with the same address. The Uno / Plus shouldn't care as it uses the last value on the bus, but the Kroko will fail whenever the first cycle takes too long. Dunno how real SC cartridges handle the absence of a R/W line, but maybe someone else has disassembled one and knows 😏
  16. Good catch. And, if the index is zero, this will generate two consecutive accesses to the same SC write address. I think this may not even be uncommon in loops that copy or initialise memory.
  17. On the other hand, if the Kroko has code that explicitly handles this case, then maybe there are other, more legitimate instructions that I don't remember from the top of my head that do two consecutive accesses to SC RAM. Can't think of any though. EDIT: and actually, thinking about it, for the cart RMW is even three consecutive accesses to the same address 😏
  18. @Al_Nafuur If it works correctly then you should rebase the branch in your repo on my "rtstella" branch and adapt the build instructions (configure with "--host=rtstella", "make clean" before building, run without "nice -20", check the output on the console for "rtstella main loop"). You can also modifiy configure on your branch to build rtstella by default. Atm @Kroko is not running rtstelle, but the normal scheduling loop. I think this is for RMW instructions. These perform two consecutive writes to the same address. The first write writes the operand value, the second one the actual result. The address bus does not change in between, but you need to read the correct result from cycle 2. Actually, this is an issue: if stella takes to long before it puts the second value on the bus the cart will read the wrong value. This is one case in which the assumption that the cart does not care about longer cycles breaks. Although, on second thought, I can't imageine why anyone would do a RMW on SC RAM, short of a bug.
  19. Exactly, I think they are so similar that the difference is irrelevant --- the precise timing is established in Stella's dispatch loop. Definitely possible, but you'd need to write your own drivers for audio and video, which would be a major undertaking. And in order to get all the shiny bells and whistles offered by Stella (high quality scaling, scanlines, phosphor, vsync etc.) in realtime you'd either have to try and squeeze it from the CPU or write a GPU driver, too. I don't even want to think about the amount of work involved 😏 An alternative would be an existing RTOS --- I checked for one that runs on the Pi, but I couldn't find any that is open and has an accelerated video stack (or audio for that case).
  20. I don't think we can use any interrupts from userspace, we'd have to use a kernel driver for that --- we have to poll. That's imo no big deal as there isn't much we could do while waiting for the interrupt anyway (unless we speculate on the result of the read). However, NTSC or PAL frequencies are sorta-irrelevant as we need to drive the bus faster than the real VCS does: emulation works in lockstep with the bus, and if we don't squeeze out some additional time budget by shortening the cycles we don't have any time left to account for time lost due to the OS scheduler (Linux will *always* context switch, we can only try to reduce the number of context switches as much as possible).
  21. To give context for those who don't know the previous conversation: I have started a Stella branch introducing a "rtstella" target that tries to address the scheduling issues I previously pointed out. It replaces (almost all) mutexes with spinlocks and runs the emulation core on a thread with realtime priority that never voluntarily yields control to the scheduler. The thread runs continuously and is only briefly put into a busy-wait loop while the main thread handles events and copies out the next frame for rendering. @Al_Nafuur has retested his bit banging code with rtstella, and cartridges now run at full speed, provided the bit banging code uses a delay that is sufficiently precise and small (using a NOP-loop). Now @Al_Nafuur, to address your last message 😃 I am very positively surprised that the first attempt already results in full speed. The fact that you need to rely on a sufficiently short delay is not surprising. The full time for a bus cycle mustn't be longer than 1/1.18MHz (delay + emulation + GPIO access), otherwise you cannot ever get full speed. I suspect that the other delay mechanisms are too coarse and generate too long delays for this to work out. In fact, this also may explain while normal Stella gets even slower when you switch to the NOP loop: if the cycle time is too long, Stella will not sleep between timeslices (it is too slow anyway), resulting in less preemption points. Once you start staying within the time budget it will start sleeping between timeslices, and you loose more time to the scheduler. I am not surprised that the PlusCart requires longer delays, the code for some banking schemes is pretty tight on the time budget that you have between real 2600 bus cycles. DPC is particularly bad, supercharger too. As for the actual delay mechanism, imo an external timer that raises a GPIO once it expires is a good idea. The only thing to keep in mind is that GPIO access comes with an intrinsic delay, too, so there is a limit on the resolution that could be achieved this way. I saw benchmarks on the web that I can't seem to locate just now, something like 80MHz I think. The internal timer in the Broadcom would do, too, but it is very possible that Linux is already using and managing it, which may be why you couldn't get it to behave as expected. We'd either have to use a timer that is unused by the OS, or explicitly deactivate it in the kernel, which may or may not be possible. Regarding rtstella: the next thing I want to do is to remove various batching optimisations that we do to (which batch TIA and RIOT cycles and execute them in bigger chunks). Usually those are an improvement as they improve cache locality and reduce the time spent on loops, but in our case I think removing them will give a better and more predictable spread of processing times over the individual bus cycles. After that, we should try to get audio working properly. It is already working fine with some popping and underruns on my end, but that still is in a ARM64 Ubuntu VM on a M1 Mac, the real thing will likely behave differently. After that: buildroot 😏
  22. No, the only relevant change is the fixed supercharger BIOS stub. What tweaks?
  23. This update is for the original UnoCart firmware, not for the UCA firmware.
  24. Wrong place or not, thanks for the feedback 😏
×
×
  • Create New...