Jump to content

Open Club  ·  76 members

StellaRT
IGNORED

Community-Built Unnamed 1970's Video Game Console-Compatible System (WIP)


Al_Nafuur

Recommended Posts

On 9/26/2023 at 9:58 PM, DirtyHairy said:

Well, the performance counter works. After building and inserting the kernel module that I linked I can run the following sample:

 

#include <iostream>
#include <cstdint>

using namespace std;

static inline uint64_t read_pmccntr(void) {
    uint64_t val;
    asm volatile("mrs %0, pmccntr_el0" : "=r"(val));
    return val;
}

int main() {
    uint64_t samples[32];

    for (auto& sample: samples) sample = read_pmccntr();

    for (int i = 0; i < 32; i++) {
        cout << samples[i];
        if (i > 0) cout << " : " << samples[i] - samples[i-1];

        cout << endl;
    }
}

 

and the output at "-O2" is

 

13553396869
13553396888 : 19
13553396901 : 13
13553396914 : 13
13553396920 : 6
13553396926 : 6
13553396932 : 6
13553396938 : 6
13553396944 : 6
13553396950 : 6
13553396956 : 6
13553396962 : 6
13553396968 : 6
13553396974 : 6
13553396980 : 6
13553396986 : 6
13553396992 : 6
13553396998 : 6
13553397004 : 6
13553397010 : 6
13553397016 : 6
13553397022 : 6
13553397028 : 6
13553397034 : 6
13553397040 : 6
13553397046 : 6
13553397052 : 6
13553397058 : 6
13553397064 : 6
13553397070 : 6
13553397076 : 6
13553397082 : 6

 

🎉

@Al_Nafuur This should do the trick.

The performance counter  example now works on the Pi3B+ with 64bit PiOS:

output:

124784971079
124784971099 : 20
124784971112 : 13
124784971125 : 13
124784971138 : 13
124784971144 : 6
124784971150 : 6
124784971156 : 6
124784971162 : 6
124784971168 : 6
124784971174 : 6
124784971180 : 6
124784971186 : 6
124784971192 : 6
124784971198 : 6
124784971204 : 6
124784971210 : 6
124784971216 : 6
124784971222 : 6
124784971228 : 6
124784971234 : 6
124784971240 : 6
124784971246 : 6
124784971252 : 6
124784971258 : 6
124784971264 : 6
124784971270 : 6
124784971276 : 6
124784971282 : 6
124784971288 : 6
124784971294 : 6
124784971300 : 6

 

 

I have a small test programme with which I determine from which delay value the readings from a simple ROM like Combat are stable:

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <stdint.h>

#define BCM2708_PERI_BASE        0x3F000000
#define GPIO_BASE                (BCM2708_PERI_BASE + 0x200000) // GPIO controller

#define PAGE_SIZE (4*1024)
#define BLOCK_SIZE (4*1024)

int  mem_fd;
void *gpio_map;

// I/O access
volatile unsigned *gpio;


// GPIO setup macros. Always use INP_GPIO(x) before using OUT_GPIO(x) or SET_GPIO_ALT(x,y)
#define INP_GPIO(g) *(gpio+((g)/10)) &= ~(7<<(((g)%10)*3))
#define OUT_GPIO(g) *(gpio+((g)/10)) |=  (1<<(((g)%10)*3))
#define SET_GPIO_ALT(g,a) *(gpio+(((g)/10))) |= (((a)<=3?(a)+4:(a)==4?3:2)<<(((g)%10)*3))

#define GPIO_SET *(gpio+7)  // sets   bits which are 1 ignores bits which are 0
#define GPIO_CLR *(gpio+10) // clears bits which are 1 ignores bits which are 0

#define GET_GPIO(g) (*(gpio+13)&(1<<g)) // 0 if LOW, (1<<g) if HIGH

#define GET_DATA_BUS() ((*(gpio+13)&0x1fe000)>>13) // GPIO 13 - 20 ( 0b000111111110000000000000 )

#define GPIO_PULL *(gpio+37) // Pull up/pull down
#define GPIO_PULLCLK0 *(gpio+38) // Pull up/pull down clock

void setup_io();

static inline uint64_t read_pmccntr(void) {
    uint64_t val;
    asm volatile("mrs %0, pmccntr_el0" : "=r"(val));
    return val;
}

int main(int argc, char **argv) {
  uint64_t t0, t1;
  char  pp[1];

  int g,rep, delay_counter = 1000;
  uint8_t a[0x200], b;

  // Set up gpi pointer for direct register access
  setup_io();

  // Set GPIO pins 0-12 to output (6502 address)
  for (g=0; g<=12; g++){
    INP_GPIO(g); // must use INP_GPIO before we can use OUT_GPIO
    OUT_GPIO(g);
  }

  // Set GPIO pin 21 to output (Level shifter dir)
  *(gpio+(2)) = 0b001000;

  // Set GPIO pin 21 to Low (ls dir read)
  GPIO_CLR = 1<<21;

  for (rep=0x000; rep<0x200; rep++){
    GPIO_CLR = 0b1111111111111;
    GPIO_SET = rep | 0x1000;
    g = 10000;
    while(g--){asm volatile("nop"); }

    a[rep] = (uint8_t)GET_DATA_BUS();

    if(rep < 0x10){
      printf("Read %d on address: %d \n", a[rep], rep );
    }
  }
//  scanf("Start %s", &pp);
start_the_test:
  for (g=0; g<1000; g++){
    for (rep=0x000; rep<0x200; rep++){
      GPIO_CLR = 0b1111111111111;
      GPIO_SET = rep | 0x1000;
      t0 = read_pmccntr();
      do{
        t1 = read_pmccntr() - t0;
      } while( t1 < delay_counter);
      b = (uint8_t)GET_DATA_BUS();
      if ( b != a[rep]){
        printf("Readings differ with delay_counter %d in round %d on address: %d  a: %d b: %d !\n", delay_counter, g, rep, a[rep], b );
        delay_counter += 100;
        goto start_the_test;
      }
    }
    printf(".");
  }
  printf("\nAll the Same!! delay_counter %d \n", delay_counter);
  return 0;
}

//
// Set up a memory regions to access GPIO
//
void setup_io()
{
   // open /dev/mem
   if ((mem_fd = open("/dev/mem", O_RDWR|O_SYNC) ) < 0) {
      printf("can't open /dev/mem \n");
      exit(-1);
   }

   // mmap GPIO
   gpio_map = mmap(
      NULL,             //Any adddress in our space will do
      BLOCK_SIZE,       //Map length
      PROT_READ|PROT_WRITE,// Enable reading & writting to mapped memory
      MAP_SHARED,       //Shared with other processes
      mem_fd,           //File to map
      GPIO_BASE         //Offset to GPIO peripheral
   );

   close(mem_fd); //No need to keep mem_fd open after mmap

   if (gpio_map == MAP_FAILED) {
      printf("mmap error %d\n", (long)gpio_map);//errno also set!
      exit(-1);
   }

   // Always use volatile pointer!
   gpio = (volatile unsigned *)gpio_map;


} // setup_io

 

According to this routine the readings are stable with a performance counter > 7000 (cpu cycles?). However when I use the performance counter and this delay value in Stella the emulation crashes really fast. The emulation gets somewhat  stable with delays > 100,000 but it is extremely slow!

 

  • Like 1
Link to comment
Share on other sites

6 hours ago, Al_Nafuur said:

According to this routine the readings are stable with a performance counter > 7000 (cpu cycles?). However when I use the performance counter and this delay value in Stella the emulation crashes really fast. The emulation gets somewhat  stable with delays > 100,000 but it is extremely slow!

Something is very wrong here. You are right, the performance counter measures CPU cycles. If you have set the PI to performance, then one 6502 cycle is roughly 1000 ARM cycles, so 7000 cycles is already too slow by a factor of 7. The bus should stabilise much faster, as it does on a real VCS. Maybe electrical issues?

 

On the Stella end I suspect a bug. Can you maybe push your code to a branch so I can have look at it? Each read should look roughly like this:

 

1. Check how much time is left from the last cycle and spin until  P <= (T_current - T_start)

2. Store the counter at the beginning of this cycle in a T_start

3. Write the address

4. Wait for a short delay until the bus stabilizes

5. Read and return the value

 

T_start = counter at cycle start , T_current = current counter, P = bus cycle length in ARM cycles

 

Emulation happens between 5 and 1, between one call to peek and the next.

Edited by DirtyHairy
Link to comment
Share on other sites

1 hour ago, DirtyHairy said:

Something is very wrong here. You are right, the performance counter measures CPU cycles. If you have set the PI to performance, then one 6502 cycle is roughly 1000 ARM cycles, so 7000 cycles is already too slow by a factor of 7. The bus should stabilise much faster, as it does on a real VCS. Maybe electrical issues?

Yes, ~1000 cycles was what I expected at 1400Mhz a cycle is ~0.7ns. The ~7000 in my test routine was very stable and I tested it multiple times. Something is very strange here. Stella even crashes with the very long delay > 100,000 when I let it run for a longer time.

 

1 hour ago, DirtyHairy said:

On the Stella end I suspect a bug. Can you maybe push your code to a branch so I can have look at it? Each read should look roughly like this:

 

1. Check how much time is left from the last cycle and spin until  P <= (T_current - T_start)

2. Store the counter at the beginning of this cycle in a T_start

3. Write the address

4. Wait for a short delay until the bus stabilizes

5. Read and return the value

 

T_start = counter at cycle start , T_current = current counter, P = bus cycle length in ARM cycles

 

Emulation happens between 5 and 1, between one call to peek and the next.

The reads and writes look a little different. We are only checking the time left when the last cycle included a write to the cartridge (which is also true for reads/writes from RIOT and TIA).

For a read cycle we are waiting the full ~700ns (like the 6502) before we read the data bus.

 

I pushed my tests to the branch "feature/cartridgeportpmccntr":

https://github.com/Al-Nafuur/stella/tree/feature/cartridgeportpmccntr

  • Like 1
Link to comment
Share on other sites

9 hours ago, Al_Nafuur said:

For a read cycle we are waiting the full ~700ns (like the 6502) before we read the data bus.

Hm, my gut feeling is that this is too long to get full speed (doesn't leave enough time for emulation and for catching up with lost cycles), but anyway, that's not the issue here. From a brief look I can't see any issues in the code. You should definitely implement ::Reset though --- Stella first runs the emulation to detect the video mode, and then resets the system before running the actual emulation. Can't see any obvious issues with that here, though. Maybe a look at the generated assembly gives a hint.

 

I'll take a close look later this weekend when I find more time.

Link to comment
Share on other sites

1 hour ago, DirtyHairy said:

Hm, my gut feeling is that this is too long to get full speed (doesn't leave enough time for emulation and for catching up with lost cycles)

This is also my feeling. Is it possible to let Stella do processing during this delay? With a delay of around 380, PlusCart and most original carts work stably. At this point they are marginally real time, with some periodic interruptions every second or so (anyone else noticed this?) At around 450 delay Harmony started working and also most original memory carts (such as SC,E7), and it's less than real time. I feel that the fixed delays don't leave enough time for the emulation to keep up, at least in real time.

Link to comment
Share on other sites

27 minutes ago, MarcoJ said:

This is also my feeling. Is it possible to let Stella do processing during this delay?

We had discussed parallelization in the past, but it turned out to be really complicated if not impossible.

27 minutes ago, MarcoJ said:

With a delay of around 380, PlusCart and most original carts work stably. At this point they are marginally real time, with some periodic interruptions every second or so (anyone else noticed this?) At around 450 delay Harmony started working and also most original memory carts (such as SC,E7), and it's less than real time. I feel that the fixed delays don't leave enough time for the emulation to keep up, at least in real time.

Would we gain enough headroom if we eliminated the ghost peeks?

 

It should be pretty simple by changing M6502::peek(uInt16 address, Device::AccessFlags flags). Check the flags and if they are Device::NONE, ignore the peek. 

 

Note: If you have the debugger disabled, you have to reenable the Device:: definitions in m6504.m4/ins.

  • Like 2
Link to comment
Share on other sites

52 minutes ago, MarcoJ said:

This is also my feeling. Is it possible to let Stella do processing during this delay? With a delay of around 380, PlusCart and most original carts work stably. At this point they are marginally real time, with some periodic interruptions every second or so (anyone else noticed this?) At around 450 delay Harmony started working and also most original memory carts (such as SC,E7), and it's less than real time. I feel that the fixed delays don't leave enough time for the emulation to keep up, at least in real time.

This is the NOP count delay, which makes the write count longer than they need to be. That's why we want to switch to a real timer or cpu cycle counter. But I have the guts feeling they are not reliable on the Pi or the way we are trying to get these values from the CPU.

 

I will modify my test routine to let it run for hours with a fixed value of ~8000 (Pi CPU cycles?? !!) and test for bad reads from the ROM.

  • Like 1
Link to comment
Share on other sites

I don't doubt the counter or the timers, I think there is something fundamentally wrong that we are overlooking --- there is too much weird behaviour that does not line up for me. Could you do a test? For a well-known ROM (say Combat) log each peek and poke in an array, and write that to disk after about 1000000 entries. For each peek / poke the log should contain the values of the performance counter at entry and at exit, the address, the type (peek/poke) and the value read (if applicable), preferable in text format, one line per peek/poke. This should give us some insight into what is happening.

 

Just take care to reset the log in ::reset, otherwise the entries from TV standard detection and actual emulation will mix 😏

Edited by DirtyHairy
  • Like 2
Link to comment
Share on other sites

2 hours ago, Thomas Jentzsch said:

We had discussed parallelization in the past, but it turned out to be really complicated if not impossible.

I was thinking, as a thought experiment, what happens if normal Stella gets nop delays added to its read/write cycles? Would similar things happen where emulation struggles to keep up in real time? 

Link to comment
Share on other sites

Anybody touched upon how underwhelming the Pi 5 upgrade is yet? At least that's what I heard from some ARM nerds.

I would say with my experience with the Pi 400, I'm no longer a fan of the Raspberry Pi brand. I would recommend getting a Rockchip-based single-board computer or something for this project. Otherwise, we might as well be potentially generating more trash.

Link to comment
Share on other sites

1 hour ago, MarcoJ said:

I was thinking, as a thought experiment, what happens if normal Stella gets nop delays added to its read/write cycles? Would similar things happen where emulation struggles to keep up in real time? 

The effect would be the same.

 

After more digging, it seems like we can fully disable scheduling on the emulation core after all. Specifically, adding "isolcpus=domain,nohz,managed_irqs=3" to the kernel command line should fully disable scheduling and interrupts on the last core, and only threads with explicit affinity will schedule there --- this will give us full owmership of the core. Furthermore, "echo -1 > /proc/sys/kernel/sched_rt_runtime_us" allows realtime threads indefinitely without interruption (although i am not sure whether this is required with isolcpus). A good reference is https://canonical.com/blog/real-time-kernel-tuning .

 

However, we should first identify and understand the current issue with the performance counter, I don't think it is related to scheduling.

Edited by DirtyHairy
  • Like 1
Link to comment
Share on other sites

7 hours ago, Al_Nafuur said:

I will modify my test routine to let it run for hours with a fixed value of ~8000 (Pi CPU cycles?? !!) and test for bad reads from the ROM.

I think the CPU is fooling us about these timers/cycle counters! No matter what delay I choose (8000 - 20000), i can't get my test routine running for a very long time. Sooner or later a read fails! The longest run was about 20 minutes, but usually it fails in the first 4 minutes. 

I am currently testing the same hardware setup with the NOP delay routine and it is running for nearly 2 hours now (without any failed read!)

Link to comment
Share on other sites

7 minutes ago, Thomas Jentzsch said:

Would the timer and the NOP delay react differently if they are interrupted? 

Not in my test program. Interrupts from the OS (or from the debugger in Stella) don't result in read failures. AFAIK in our setup read failures can only occur when we are reading the data bus too early, before the ROM has set it.

 

Link to comment
Share on other sites

5 hours ago, DirtyHairy said:

Furthermore, "echo -1 > /proc/sys/kernel/sched_rt_runtime_us" allows realtime threads indefinitely without interruption (although i am not sure whether this is required with isolcpus).

Tried these two commands out on my pi4 rig. While Stella was already running, I ran the commands through an XRDP session.

 

isolcpus=domain,nohz,managed_irqs=3

Result: This didn't appear to change the performance at the time. (But it did help the second command - see later).

 

echo -1 > /proc/sys/kernel/sched_rt_runtime_us

Result: This dramatically reduced the lag that occurs every second or so. I have noticed this lag on both 32bit and 64bit OS. 

 

Following this I rebooted the pi and tested the two commands using the same XRDP input method while stella is running with just the "echo -1" command. It did the dramatic lag reduction without first running "isolcpus" first.

 

I tried then the CPU commands before Stella loaded. After stella loaded it froze. However, it worked fine if stella was already running and adding the commands in parallel using XRDP.

 

  • Like 1
Link to comment
Share on other sites

Shown are two videos of Berzerk guy running normally, and with the scheduler disabled (echo -1 > /proc/sys/kernel/sched_rt_runtime_us). In the first video his running gets paused every second or so. In the second video his running is smooth.

 

 

1. Normal (performance CPU governor).

 

 

 

2. Scheduler disabled (with performance CPU governor)

 

 

 

 

  • Like 1
Link to comment
Share on other sites

1 hour ago, MarcoJ said:

Tried these two commands out on my pi4 rig. While Stella was already running, I ran the commands through an XRDP session.

Note, after exiting or restarting stella with these settings causes stella to freeze and require "killall -9" treatment. Also entering the debugger does this. The "echo -1" command does help reduce lag a lot but also creates crashing issues. It might need a more sophisticated approach to apply.

Link to comment
Share on other sites

The effect of the "echo -1" command on the emulation performance is more pronounced in this illustration.

 

1. Normal (performance CPU)

 

 

2. Schedule disabled (with performance CPU)

 

 

In the first video, the sound has stutters and other artefacts, and is overall slower. In the second video, the sound is smooth and runs close(r) to real time.

 

  • Like 2
Link to comment
Share on other sites

1 hour ago, MarcoJ said:

Result: This dramatically reduced the lag that occurs every second or so. I have noticed this lag on both 32bit and 64bit OS. 

Awesome! The lag comes from the scheduler periodically interrupting the emulation thread. Changing /proc/sys/kernel/sched_rt_runtime_us will disable that.

1 hour ago, MarcoJ said:

Result: This didn't appear to change the performance at the time.

😏 It's not a command, but a kernel parameter. You have to append it to the kernel command line (cmdline.txt) on the boot partition. I just doublechecked and tested myself, you have to append "isolcpus=domain,managed_irq,3" . The nohz part would also be useful, but we'd need to rebuild the kernel to support that. Anyway, this will reserve the fourth core for Stella to claim, and no other threads or processes will be scheduled there.

29 minutes ago, MarcoJ said:

It might need a more sophisticated approach to apply.

Nah, I think those are just bugs in my thread handling. I tried to make sure that the emulation thread gives up control and yields if Stella quits emulation mode, but it seems I missed something. I am pretty sure that I can debug and solve this when I find time.

  • Like 1
Link to comment
Share on other sites

2 hours ago, Al_Nafuur said:

I think the CPU is fooling us about these timers/cycle counters!

I'm not convinced 😏 The performance counters are part of the ARM architecture spec, nothing PI specific, and they should work fine. Maybe we are using them wrong, maybe there is something we overlooked, and maybe there is something else wrong altogether. We'll only find out by systematic debugging. What would help:

  • Test the performance counter against a well-known clock source
  • Generate a square wave on a GPIO pin by using a loop against the counter, and check that with a scope.

Both are on my list, but not today (and probably not tomorrow).

Link to comment
Share on other sites

2 minutes ago, MarcoJ said:

Question, will/does it make a difference if the "Multithreaded" option is ticked in Stella?

Ho-hum, good question. It may be that those threads are dispatched from the emu thread, in which case they would inherit the affinity and RT priority and will start competing with the emu thread for core 3. They should cause a lockup, though. Do you see a difference if you tick that option?

 

Oh, btw, does only the debugger freeze Stella, or does the menu freeze it, too?

Link to comment
Share on other sites

4 minutes ago, DirtyHairy said:

Oh, btw, does only the debugger freeze Stella, or does the menu freeze it, too?

Apologies, just went out. Will test in a few hours.

 

3 minutes ago, DirtyHairy said:

That looks pretty much awesome to me, much better than I would have initially hoped for.

Excellent. The echo -1 makes a big difference.

Link to comment
Share on other sites

6 hours ago, DirtyHairy said:

Oh, btw, does only the debugger freeze Stella, or does the menu freeze it, too?

OK, did this test. This was, testing if Stella froze in various places after entering the command "echo -1 > /proc/sys/kernel/sched_rt_runtime_us". Tried with Multi-threaded ticked, and also unticked. Stella is OK with going into the "Options" dialog with the tab button either way - game recovers OK when execution is restarted. Debugger entry freezes stella to a black canvas in both cases. When in the black canvas state, I could only get stella to end by issuing a killall -9 command from another bash window. After the killall command, sometimes stella is able to re-start successfully from commandline but sometimes freezes to a black canvas. 

 

 

 

Link to comment
Share on other sites

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...