Jump to content

Open Club  ·  76 members

StellaRT
IGNORED

Community-Built Unnamed 1970's Video Game Console-Compatible System (WIP)


Al_Nafuur

Recommended Posts

9 hours ago, MarcoJ said:

I have also found this. Some more complex schemes are on the knife's edge of working. Especially during the beginning transition of the cycle, delaying the presentation of address can be fatal.

If Harmony handles this with just 60 MHz, why can't the UnoCart handle this easily? 

 

 

  • Like 1
Link to comment
Share on other sites

4 hours ago, Thomas Jentzsch said:

If Harmony handles this with just 60 MHz, why can't the UnoCart handle this easily? 

The Uno firmware is written in C and runs from flash, while the Harmony (to my knowledge) is hand-optimised assembly running from RAM.

Edited by DirtyHairy
  • Like 1
Link to comment
Share on other sites

Tonight i've been reading the Stella source. From where are the calls to the Cart class functions Peek and Poke originated from in general? I'm a bit lost. Trying to find the main loop. I see a main loop in OSystem.cxx and OSystemRTStella.cxx but am not sure where it translates to peeks and pokes.

Link to comment
Share on other sites

5 hours ago, MarcoJ said:

Tonight i've been reading the Stella source. From where are the calls to the Cart class functions Peek and Poke originated from in general? I'm a bit lost. Trying to find the main loop. I see a main loop in OSystem.cxx and OSystemRTStella.cxx but am not sure where it translates to peeks and pokes.

That happens in M6502::execute(). The peeks are in M6502.ins, which is included in the M6502 class.

  • Like 1
Link to comment
Share on other sites

On 10/6/2023 at 1:15 PM, MarcoJ said:

@Al_Nafuur are you planning to write in AVR assembly or c?

 

I'm starting with assembly. Looks like there's quite a few single clock instructions to form the delay. Seems like registerwise, there's PortB and DDRB (both 6 bits) for setting data direction. 

 

I have finally managed to setup my "development environment" for the ATtiny:

IMG_20231010_084453_1.thumb.jpg.527e7235523da29f21e413d1b5b4f88b.jpg

I had no Arduino so I used a WeMos D1 mini clone as ISP (changes to the ArduinoISP script)

 

For coding/uploading I use the Arduino IDE. The script is C code, for the delay I used ASM:

Spoiler

// Define pins

#define START_CYLCE_PIN  PB1             // Input pin of choice: PB1 (same as PCINT1) - Pin 6

#define CYCLE_ACTIVE_PIN PB4             // Output pin of choice: PB4 - Pin 3

 

// Define pin states

#define ACTIVE HIGH

 

// Define helper

#define STR_HELPER(x) #x

#define STR(x) STR_HELPER(x)

 

void setup() {

  OSCCAL = 181;

//  TCCR1 = 1<<CTC1 | 0<<COM1A0 | 1<<CS10; // CTC mode, /1

//  GTCCR = 1<<COM1B0;                     // Toggle OC1B

  PLLCSR = 0<<PCKE;                      // System clock as clock source

//  OCR1C = 0;

 

  cli();                                 // Disable interrupts

  pinMode(START_CYLCE_PIN, INPUT_PULLUP);// Set our input with a pullup to keep it stable

  pinMode(CYCLE_ACTIVE_PIN, OUTPUT);

}

 

void loop() {

  if(digitalRead(START_CYLCE_PIN) == ACTIVE ){

    //digitalWrite(CYCLE_ACTIVE_PIN, HIGH);

    asm volatile ("sbi 0x18," STR(CYCLE_ACTIVE_PIN) ); // Port pin PB4 high, 2 clock cycles ~= 100ns

 

    asm volatile ("nop");

    asm volatile ("nop");

    asm volatile ("nop");

    asm volatile ("nop"); //  4 cycles ~= 200ns

 

    asm volatile ("nop");

    asm volatile ("nop");

    asm volatile ("nop");

    asm volatile ("nop"); //  8 cycles ~= 400ns

 

    asm volatile ("nop");

    asm volatile ("nop");

    asm volatile ("nop");

    asm volatile ("nop"); // 12 cycles ~= 600ns

 

    //digitalWrite(CYCLE_ACTIVE_PIN, LOW);

    asm volatile ("cbi 0x18," STR(CYCLE_ACTIVE_PIN)); // Port pin PB4 low, 2 clock cycles ~= 100ns

 

    while(digitalRead(START_CYLCE_PIN) == ACTIVE);    // Wait for active getting LOW again!

  }

}

 

The "digitalWrite" function is already replaced with ASM. The "digitalRead" still has to be replaced by ASM. Also I am not sure if "OSCCAL = 181;PLLCSR = 0<<PCKE;" is enough to set the CPU clock to ~20Mhz.

image.png.1dccad8eb81823e8620cccceb7ed65a2.png

 

Testing on the flash bread board with a LED (and longer delay) at the 3.3V of the WeMos worked fine. Testing on the Pi doesn't worked so far. I have to check the wiring, the voltage and my Pi test code.

 

  • Like 2
Link to comment
Share on other sites

16 hours ago, Al_Nafuur said:

on the flash bread board with a LED (and longer delay) at the 3.3V of the WeMos worked fine. Testing on the Pi doesn't worked so far. I have to check the wiring, the voltage and my Pi test code.

Nice one. Presuming the avr i/os are running at 3.3v, rather than 5v? I had to install a second lv245 to duck the levels on mine. 
 

Are you using the timer as a substitute for nop delays with the current stella code? I’ve been been scratching my head trying to find a solution to allowing a full 6507 cycle length whilst being able to give some time to stella to do processing. 

Link to comment
Share on other sites

21 minutes ago, MarcoJ said:

Nice one. Presuming the avr i/os are running at 3.3v, rather than 5v? I had to install a second lv245 to duck the levels on mine.

The ATtiny should work with 2.7v, but unfortunately not at 20Mhz. I thought 3.3v would be enough, but it looks like they need 4.5v above 10Mhz:

Bildschirmfoto2023-10-11um02_59_26.thumb.png.522cb1e76dc4842e398a0278ca904d06.png

 

Link to comment
Share on other sites

39 minutes ago, Al_Nafuur said:

but it looks like they need 4.5v above 10Mhz

Atmel, what an annoying little detail.

 

i’ve been thinking that if GPIOs are used to set and reset the timer, potentially a software process could do the same and potentially supersede the hardware timer(software processes communicate with each other by gpio).

  • Like 1
Link to comment
Share on other sites

11 hours ago, MarcoJ said:

Atmel, what an annoying little detail.

Yes.

 

11 hours ago, MarcoJ said:

i’ve been thinking that if GPIOs are used to set and reset the timer, potentially a software process could do the same and potentially supersede the hardware timer(software processes communicate with each other by gpio).

🤔

That sounds like an interesting solution. We might add one more CPU to isolcpus for this process.

  • Like 1
Link to comment
Share on other sites

On 10/10/2023 at 4:06 AM, Thomas Jentzsch said:

That happens in M6502::execute(). The peeks are in M6502.ins, which is included in the M6502 class.

Ah, there it is, the heart of it.

       // Fetch instruction at the program counter
        IR = peek(PC++, DISASM_CODE);  // This address represents a code section

        // Call code to execute the instruction
        switch(IR)
        {
          // 6502 instruction emulation is generated by an M4 macro file
          #include "M6502.ins"

          default:
            FatalEmulationError::raise("invalid instruction");
        }

 

IR= peek fetches the next instruction. The M6502.ins file has a comprehensive switch statement of how each instruction contextually peeks or pokes and what the operands do. What is interesting is that each instruction and operands are executed in a cluster; the peeks or pokes for the operand are done here in one go before exiting the function. This to some degree could give some control over timing between each operation. The trick is still finding some time for Stella to have breathing space to process the emulation with the read/write delays in place. 

 

  • Like 1
Link to comment
Share on other sites

On 10/11/2023 at 3:07 PM, Al_Nafuur said:
On 10/11/2023 at 3:43 AM, MarcoJ said:

i’ve been thinking that if GPIOs are used to set and reset the timer, potentially a software process could do the same and potentially supersede the hardware timer(software processes communicate with each other by gpio).

That sounds like an interesting solution. We might add one more CPU to isolcpus for this process.

 

I did some tests on this. And it is possible to use a GPIO to communicate between to processes.

 

First I added "isolcpus=domain,managed_irq,2,3" to "/boot/cmdline.txt" to isolate CPU 2 and 3

Then on the terminal:

sudo echo performance >  /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
sudo echo -1 > /proc/sys/kernel/sched_rt_runtime_us

 

CycleManager.c (compile: "gcc c++ -Wall -O2 CycleManager.c -o CycleManager"):

Spoiler

// Internal cycle length manager via GPIO 22
//

//#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <stdint.h>
#include <sched.h>

#define BCM2708_PERI_BASE        0x3F000000
#define GPIO_BASE                (BCM2708_PERI_BASE + 0x200000) // GPIO controller


#define PAGE_SIZE (4*1024)
#define BLOCK_SIZE (4*1024)

int  mem_fd;
void *gpio_map;

// I/O access
volatile unsigned *gpio;

cpu_set_t mask;


// GPIO setup macros. Always use INP_GPIO(x) before using OUT_GPIO(x) or SET_GPIO_ALT(x,y)
#define INP_GPIO(g) *(gpio+((g)/10)) &= ~(7<<(((g)%10)*3))
#define OUT_GPIO(g) *(gpio+((g)/10)) |=  (1<<(((g)%10)*3))
#define SET_GPIO_ALT(g,a) *(gpio+(((g)/10))) |= (((a)<=3?(a)+4:(a)==4?3:2)<<(((g)%10)*3))

#define GPIO_SET *(gpio+7)  // sets   bits which are 1 ignores bits which are 0
#define GPIO_CLR *(gpio+10) // clears bits which are 1 ignores bits which are 0

#define GET_GPIO(g) (*(gpio+13)&(1<<g)) // 0 if LOW, (1<<g) if HIGH

#define GET_DATA_BUS() ((*(gpio+13)&0x1fe000)>>13) // GPIO 13 - 20 ( 0b0001 1111 1110 0000 0000 0000 )

#define GET_CYCLE_ACTIVE() (*(gpio+13)&(1<<22)) // GPIO 22 ( 0b0100 0000 0000 0000 0000 0000 )

#define GPIO_PULL *(gpio+37) // Pull up/pull down
#define GPIO_PULLCLK0 *(gpio+38) // Pull up/pull down clock

void setup_io();

int main(int argc, char **argv) {

  uint32_t g;
  
  printf("Starting Cycle Manager\n");

 

  // set CPU we want to run on
  CPU_ZERO(&mask);
  CPU_SET(2, &mask);
  int result = sched_setaffinity(0, sizeof(mask), &mask);
  
  printf("result %d \n", result );
  

  // Set up gpi pointer for direct register access
  setup_io();

  // Set GPIO pin 22 to output (Cycle active pin)
  INP_GPIO(22); // must use INP_GPIO before we can use OUT_GPIO
  OUT_GPIO(22);
  
  while(1){
    if((*(gpio+13)&(1<<22)) ){
//  printf("Cycle Start!\n");
      g = 300;
      while(--g){
        asm volatile("nop");
      }
      GPIO_CLR = 1<<22;
//  printf("Cycle End!\n");
    }
  }
  return 0;
}


//
// Set up a memory regions to access GPIO
//
void setup_io()
{
   // open /dev/mem
   if ((mem_fd = open("/dev/mem", O_RDWR|O_SYNC) ) < 0) {
      printf("can't open /dev/mem \n");
      exit(-1);
   }

   // mmap GPIO
   gpio_map = mmap(
      NULL,             //Any adddress in our space will do
      BLOCK_SIZE,       //Map length
      PROT_READ|PROT_WRITE,// Enable reading & writting to mapped memory
      MAP_SHARED,       //Shared with other processes
      mem_fd,           //File to map
      GPIO_BASE         //Offset to GPIO peripheral
   );

   close(mem_fd); //No need to keep mem_fd open after mmap

   if (gpio_map == MAP_FAILED) {
      printf("mmap error %d\n", (int)gpio_map);//errno also set!
      exit(-1);
   }

   // Always use volatile pointer!
   gpio = (volatile unsigned *)gpio_map;


} // setup_io
 

start with "sudo nice -n -20 ./CycleManger"

 

Here is the test routine (same compiler flags and start as CycleManager):

Spoiler

// Test for internal cycle manager.
//

//#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <stdint.h>
#include <sched.h>

#define BCM2708_PERI_BASE        0x3F000000
#define GPIO_BASE                (BCM2708_PERI_BASE + 0x200000) // GPIO controller


#define PAGE_SIZE (4*1024)
#define BLOCK_SIZE (4*1024)

int  mem_fd;
void *gpio_map;

// I/O access
volatile unsigned *gpio;

cpu_set_t mask;


// GPIO setup macros. Always use INP_GPIO(x) before using OUT_GPIO(x) or SET_GPIO_ALT(x,y)
#define INP_GPIO(g) *(gpio+((g)/10)) &= ~(7<<(((g)%10)*3))
#define OUT_GPIO(g) *(gpio+((g)/10)) |=  (1<<(((g)%10)*3))
#define SET_GPIO_ALT(g,a) *(gpio+(((g)/10))) |= (((a)<=3?(a)+4:(a)==4?3:2)<<(((g)%10)*3))

#define GPIO_SET *(gpio+7)  // sets   bits which are 1 ignores bits which are 0
#define GPIO_CLR *(gpio+10) // clears bits which are 1 ignores bits which are 0

#define GET_GPIO(g) (*(gpio+13)&(1<<g)) // 0 if LOW, (1<<g) if HIGH

#define GET_DATA_BUS() ((*(gpio+13)&0x1fe000)>>13) // GPIO 13 - 20 ( 0b0001 1111 1110 0000 0000 0000 )

#define GET_CYCLE_ACTIVE() (*(gpio+13)&(1<<22)) // GPIO 22 ( 0b0100 0000 0000 0000 0000 0000 )

#define GPIO_PULL *(gpio+37) // Pull up/pull down
#define GPIO_PULLCLK0 *(gpio+38) // Pull up/pull down clock

void setup_io();

int main(int argc, char **argv) {
//  uint32_t t0, t1;
  char  pp[1];
  
  uint32_t test;

  int i,g,rep, delay_counter = 700;
  uint8_t a[0x1000], b;

  printf("Starting CartTester\n");
  // set CPU we want to run on
  CPU_ZERO(&mask);
  CPU_SET(3, &mask);
  int result = sched_setaffinity(0, sizeof(mask), &mask);
  
  printf("result %d \n", result );
  

  // Set up gpi pointer for direct register access
  setup_io();

  // Set GPIO pins 0-12 to output (6502 address)
  for (g=0; g<=23; g++){
    INP_GPIO(g); // must use INP_GPIO before we can use OUT_GPIO
    if(g < 13 || g > 20)
      OUT_GPIO(g);
  }


  // Set GPIO pin 21 & 22 to output (Level shifter dir, Cycle timer)
  *(gpio+(2)) = 0b001001000;

  // Set GPIO pin 21 to Low (ls dir read)
  GPIO_CLR = 1<<21;

  // Set GPIO pin 22 to Low (cycle timer off)
  GPIO_CLR = 1<<22;

  // Wait cycles length test
  g=100;
  while(g--){
    
  //  printf("1 Active PIN state 1 %d \n", GET_CYCLE_ACTIVE() );
  //  i = 10;
  //  while(i--){asm volatile("nop"); }
  //  printf("Cycle Start!\n");
    GPIO_SET = 1<<22;
    i = 50;
    while(i--){asm volatile("nop"); }

  //  printf("2 Active PIN state %d \n", GET_CYCLE_ACTIVE() );
    i = 0;

    test = (*(gpio+13)&(1<<22));

    while( test ){
      i++;
      test = (*(gpio+13)&(1<<22));
    }
    if(i == 0){
      printf("0 cycles wait?! %d \n", g );
    }

    i = 50;
    while(i--){asm volatile("nop"); }
  //  printf("Cycle End!\n");
  //  printf("3 Active PIN state %d \n\n", GET_CYCLE_ACTIVE() );
  }  

start_ROM_read_test:
  for (rep=0x000; rep<0x1000; rep++){
    GPIO_CLR = 0b1111111111111;
    GPIO_SET = rep | 0x1000;
    i = 30000;
    while(i--){asm volatile("nop"); }

    a[rep] = (uint8_t)GET_DATA_BUS();

    if(rep < 0x10){
      printf("Read %d on address: %d \n", a[rep], rep );
    }
  }
  scanf("%s", pp);


  for (g=0; g<100; g++){
//  while(1) {
    for (rep=0x000; rep<0x1000; rep++){
      GPIO_CLR = 0b1111111111111;
      GPIO_SET = rep | ( 0x1000 + (1<<22));
    i = 3;
    while(i--){asm volatile("nop"); }

      while(GET_CYCLE_ACTIVE() );

      b = (uint8_t)GET_DATA_BUS();
      if ( b != a[rep]){
        printf("Readings differ with delay_counter %d in round %d on address: %d  a: %d b: %d !\n", delay_counter, g, rep, a[rep], b );
        delay_counter += 10;
        goto start_ROM_read_test;
//          return 0;
      }
    }
    printf(".");
  }
  printf("\nAll the Same!! delay_counter %d \n", delay_counter);


    return 0;
}


//
// Set up a memory regions to access GPIO
//
void setup_io()
{
   // open /dev/mem
   if ((mem_fd = open("/dev/mem", O_RDWR|O_SYNC) ) < 0) {
      printf("can't open /dev/mem \n");
      exit(-1);
   }

   // mmap GPIO
   gpio_map = mmap(
      NULL,             //Any adddress in our space will do
      BLOCK_SIZE,       //Map length
      PROT_READ|PROT_WRITE,// Enable reading & writting to mapped memory
      MAP_SHARED,       //Shared with other processes
      mem_fd,           //File to map
      GPIO_BASE         //Offset to GPIO peripheral
   );

   close(mem_fd); //No need to keep mem_fd open after mmap

   if (gpio_map == MAP_FAILED) {
      printf("mmap error %d\n", (int)gpio_map);//errno also set!
      exit(-1);
   }

   // Always use volatile pointer!
   gpio = (volatile unsigned *)gpio_map;


} // setup_io
 

 

But  no matter what delay I choose in Cycle Manger the ROM reading fails mostly after 10 bytes or less

 

  • Like 1
Link to comment
Share on other sites

6 hours ago, Al_Nafuur said:

I did some tests on this. And it is possible to use a GPIO to communicate between to processes.

Wow! You have been busy. Code and all.

6 hours ago, Al_Nafuur said:

But  no matter what delay I choose in Cycle Manger the ROM reading fails mostly after 10 bytes or less

I wonder if there's a binding problem where two threads trying to read or write to the GPIO at the same time gets bent and twisted. 

  • Like 1
Link to comment
Share on other sites

20 hours ago, MarcoJ said:

I wonder if there's a binding problem where two threads trying to read or write to the GPIO at the same time gets bent and twisted. 

They both have read/write access to the GPIO pin, but they shouldn't do same kind of access at the same time. While one is writing the other one is only reading, but maybe this is "disturbance" enough?

 

But it occurred to me that instead of a GPIO pin the two processes should use shared memory? I tried some nmap and shmget examples, but I'll have to do further testing tomorrow.

Link to comment
Share on other sites

2 hours ago, Al_Nafuur said:

But it occurred to me that instead of a GPIO pin the two processes should use shared memory? I tried some nmap and shmget examples, but I'll have to do further testing tomorrow.

That would be the better solution. If it's possible, that would be good. Hopefully it's not too stifling to navigate protected memory.

  • Like 1
Link to comment
Share on other sites

19 minutes ago, DirtyHairy said:

Why two processes? Two threads are sufficient, and they share memory.

Yes, with the shared memory I also switch to threads.

 

20 minutes ago, DirtyHairy said:

Use std::atomic to increment an atomic counter.

Maybe we don't need a atomic counter, a (atomic) flag in shared memory should be sufficient. 🤔

  • Like 1
Link to comment
Share on other sites

19 minutes ago, Al_Nafuur said:

Yes, with the shared memory I also switch to threads.

Does this mean the timer code needs to be embedded into the Stella build? I imagine it would only be necessary to exist under the rtstella build, perhaps some indef logic needed. I do think also that a simple flag system, perhaps a single uint8 could make inter thread communication possible. 

 

 

Link to comment
Share on other sites

This is a bit of a Milestone post. My pi4 rig has had Dig dug running in demo mode for 2 days now, without glitches. Before that, Super Football had been running too for 3 days. The RTStella software is running with the CartridgePort GPIO driver, nop delay system, combined with the performance CPU governor and Linux scheduler off. This combination of software configuration has allowed many original cartridges with and without RAM to run at real time speed, without glitches in timing and/or execution.

 

IMG_6923.thumb.JPG.4e3e8853046f3e7cae31a400760366e4.JPG

 

The nop delay system takes advantage of the fact that the original cartridges in many cases do not need the full 6507 read/write cycle to operate stably. A fixed delay is enforced between reads and writes to simulate a shrunken 6507 timing cycle. The reads and writes on the cartridge bus are much snappier than a normal Atari 2600, which leaves plentiful time for Stella to catch up on emulating the game. It is a kind of "overclocking", pushing the cartridge hardware to strobe faster than normal. Once Stella has done it's emulation catch up, its own internal timers allows the governing of real time speed, so that it doesn't emulate too fast. This first milestone works great for many original Atari 2600 cartridges.

 

The downside of this approach is that some games which do rely on a full 6507 cycle will not work stably. Such games are where co-processors are used such as ARM chips. In such architectures, the co-processor is doing code bursts ,and needs to complete all of the required processing before the next 6507 cycle. In the nop delay system with shrunken cycle length, RTStella will ask for the next data too early, and the co-processor can't complete its job of processing within the full cycle. This typically results in the bus cycle being missed, and the game execution crashes. This has also proven true for flash carts emulating original game carts which require more of the 6507 bus cycle to be able to emulate a cartridge as compared to more primitive cartridges with industry standard ROM and RAM chips intended for cartridge use. In the case of flash carts, longer nop delays have been required for stability which has often prevented Stella working in real time due to lacking the correct quantity of time to process the cartridge emulation. This has lead to slowdown of some games on flash carts.

 

The next milestone that is being worked towards is use of an accurate timer to enforce full length 6507 cycles. The hope is to support more or hopefully all cartridges that would otherwise work on the real console. The challenge of this approach is that apart from performing the low level strobing of the cartridge, Stella is also trying to emulate the game in real time. In the past, enforcing long delays made Stella run slow. Thankfully, the Raspberry pi has multiple CPU cores. The hope is that with accurate timing and some parallel tasking that the end goal of full 6507 cycle length, and real time emulation could be achieved. 

 

Until that time, hooray for Milestone 1 of the CartridgePort / RTStella project (nop delay method), where many original cartridges can be emulated in real time and flash cartridges work on / or close to real time.

 

 

 

 

 

RTStella Raspberry Pi Interface Milestone 1 v0.2.pdf

Edited by MarcoJ
Milestone 1 Interface Schematic update
  • Like 8
Link to comment
Share on other sites

12 hours ago, Al_Nafuur said:

Yes, with the shared memory I also switch to threads.

 

Maybe we don't need a atomic counter, a (atomic) flag in shared memory should be sufficient. 🤔

I have pushed my WIP code for the threaded timer. It is on the branch "feature/cartridgeportThread" and currently works only if Stella is started with "-debug" to the debugger window. Switching to the emulation window freezes the Stella window.

  • Like 1
Link to comment
Share on other sites

39 minutes ago, Al_Nafuur said:

I have pushed my WIP code for the threaded timer. It is on the branch "feature/cartridgeportThread" and currently works only if Stella is started with "-debug" to the debugger window. Switching to the emulation window freezes the Stella window.

Just pushed a fix which changed the CPU for the cycle timer thread. Now the emulation works, but again I have trouble getting a stable emulation on my Pi 3.

 

  • Like 1
Link to comment
Share on other sites

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...