Jump to content
IGNORED

Back to the emulator...


robus

Recommended Posts

I'd taken a break from my Atari 400/800 emulator effort after getting very slow performance. But after talking it over with a friend at Thanksgiving I decided I needed to rework my main CPU core logic to get it running faster.

 

Well I did that and after converting it to a microcode based approach (much cleaner!) I was disappointed to see my cycle time not improve much. Then on a whim I chopped out the GTIA stuff and just by running at full speed I'm getting cycles per second in the 10M+!

 

For example:
 

cycles per second: 10,907,708
cycles per second: 10,705,184
cycles per second: 11,008,288
cycles per second: 10,971,878

 

So obviously that's no longer a problem (and perhaps never was - sheesh!)

 

Anyway. Now to find out what's going wrong with my GTIA/ANTIC emulation (probably also very slow).

 

But at least I'm back to making progress!

  • Like 1
Link to comment
Share on other sites

I've been using dispatch_source_set_timer() to create a clock for the CPU, but it seems like it's very heavy duty. Just turning it on instead of running at full speed just slows the whole thing down massively:

 

cycles per second: 163,055
cycles per second: 209,956
cycles per second: 214,996
cycles per second: 216,161

 

Obviously that's not going to work. So what's a better approach to getting a fast steady tick into the CPU?

Link to comment
Share on other sites

If you are trying to use a system timer to regulate running emulated CPU at the cycle level, that's going to be a massive amount of overhead for little gain, especially if it has to call into the kernel to set up the timer. Modern CPUs rely heavily on locality of code and data, and doing that heavily increases the working set per emulated cycle processed.

 

More generally, it also won't actually get you anything unless you have I/O that can take advantage of the granularity. That includes input like keyboard and controller events, and output like video and audio frames. Between those I/O points, it doesn't matter whether you update the emulated state in a batch all at once or spaced out at real time. Therefore, you're better off running the emulation as fast as possible between those points. The OS is going to be limited both by hardware and software design as to how often it can actually update I/O, since input devices are only polled so often and the output devices typically get fed every several milliseconds. A reasonable compromise is to run at least a few dozen scan lines of emulation at time.

 

System timers also have practical limits with accuracy and precision. The underlying timer source can be as course as microseconds, and additionally some OSes adjust or coalesce timers in order to reduce interrupt overhead and reduce power consumption. Drift can also be an issue depending on the precision of the timer interval and whether the timer design is biased toward time interval consistency or accuracy. As such, it's a good idea to plan to add a control loop to monitor and adjust long-term timing. This can be very simple to start if you're running the emulation a frame at a time -- just keep track of the error between actual time versus desired time each time the timer fires, and apply small corrections to the timer interval as needed to maintain desired average speed.

 

Finally, there's an inherent tradeoff between timing accuracy and output latency. The Atari runs slightly off from standard frame rates, at 59.92 fps for NTSC and 49.86 fps for PAL. If you run at the accurate frame rate, the emulated frame timing will diverge slightly from the output frame timing, requiring resampling frames and adding half a frame of latency on average. The alternative is to alter emulation timing and lock to vsync rate instead, if it is close enough. The granularity at which the emulation is run and any additional buffering for resampling will determine the minimum amount of video and audio buffering needed.

 

  • Like 2
Link to comment
Share on other sites

It's easiest to have a frame counter, i.e. a full screen.

 

On PAL the CPU clock is 1773447 Hz. A full screen is 312 scan lines of exactly 114 cycles, so the frame rate is 1773447/(312*114) = 49.86074561 frames per second.

 

So, each full emulation cycle runs for 312*114 cycles, generates one bitmap of graphics and 1/49.86074561th of a second of audio data.

 

49.86 fps does not easily match the refresh rate of your current display, so easiest is to just do as if it was 50Hz, upsample the audio to retain proper frequencies, but do _not_ execute more cycles per frame.

 

Edit: NTSC numbers are (rounded up) 1789793 Hz, one frame is 262*114 cycles, exact frame rate is 1789793/(262*114) = 59.923429

 

Edit2: here's an interesting link on instruction ticked / cycle stepped emulation: https://floooh.github.io/2019/12/13/cycle-stepped-6502.html

 

Edit3: fixed NTSC clock as per Phaeron's observant eye 🔍

Edited by ivop
  • Like 3
Link to comment
Share on other sites

Yep, in general emulators don't talk to the outside world in a way where timing accuracy below about a frame is important.  The important thing is that what you see and hear matches what the real hw does.

So audio - you can build a frame or two worth and just buffer it.  If there's a little audio lag it shouldn't matter much.

Video - similarly, build a screen and display it as a real machine would, post-processing for RGB triad or scanlines or whatever, so long as you see it about 1/50th second later then it's fine.

User input - possibly more important than the others - low latency desirable but kb and controller input from the PC should be quick and easy to process and pass through anyway.

  • Like 1
Link to comment
Share on other sites

14 minutes ago, phaeron said:

1789773 Hz (315/88 / 2)

Yeah, sorry. I cut-and-pasted it from a source I actually knew was wrong, but didn't think about that.

But shouldn't it be crystal (3579545) divided by 2? That would make it 3579545/2 = 1789772.5

 

Edit: and the XE series is:

 

PAL 14187570/8=1773446.25
NTSC 14318180/8=1789772.5

Edited by ivop
Link to comment
Share on other sites

So, if I’m understanding correctly, I should just run the video refresh at 50 fps and, in the vertical bank period basically, let the CPU process as much as it wants until ANTIC grabs the bus and paints the next screen?

 

(not worrying about audio for now)

Edited by robus
Link to comment
Share on other sites

It seems you're still bootstrapping a lot of basic functionality, so I'd recommend structuring everything around scanline granularity for now -- interleave running ~100 cycles of CPU and having ANTIC+GTIA render a scanline on screen. Repeat 262 or 312 times per frame. There will be a lot of timing slop compared to the real hardware, but scan line granularity is enough to get a lot working including DLIs. I wouldn't worry about accurate timing until you have all of the basic functionality implemented and a fair amount of software running.

 

  • Like 1
Link to comment
Share on other sites

Not exactly. You need to interleave the CPU cycles and the ANTIC/GTIA cycles for a whole frame, and then display it. Look at the ANTIC datasheet and the Altirra Hardware Manual to see when exactly ANTIC steals cycles and what for. Things like writes to the color register (GTIA) during a scan line has to be timed 100% correctly, otherwise your display will be wrong. Where the exact write occurs depends on the cycles stolen by ANTIC. RAM refresh (9 cycles, sometimes 8), reading the display list, reading screen memory, sometimes reading character set data, reading PM data if DMA is enabled, et cetera.

 

Once you have display code, you also need to emulate Pokey for every single cycle, and the CONSOL/speaker GTIA bit mixed in, which will result in 312*114 samples, which you need to run through a downsampling filter to, for example, 882 samples (44100/50) if your audio output is 44.1kHz.

 

Basically, your loop will be something like this for every cycle:

 

run ANTIC, GTIA and POKEY cycle

check if ANTIC stole a cycle,

if so, skip CPU cycle

if not, run CPU cycle

loop 312*114 times

 

busy wait for VSYNC

display bitmap

play snippet of audio

back to loop

 

I think I got the order right that GTIA color changes do not occur too early, but it's late so I might have the order reversed.

 

You could consider a pull design, like a video player, where the play loop is requesting new data through callbacks. See libSDL documentation for details. You'll get your sync for free then. The play loop requests new data each time it almost runs out of currently playing data. So instead of you pushing graphics and audio to the screen and audio device, it pulls it from your emulation code.

  • Like 1
Link to comment
Share on other sites

I've got it reworked so the CPU is now pulled from the GTIA side. Basically a frame is rendered and cycle time is given to the CPU during the VBI and HBIs.

 

Screen render time is current .04 seconds, so too slow for 50 fps. Needs to be below .02 s but I've got a bunch of Objective-C to replace with C functions which should do the trick.

 

I'm not yet worrying about the issues @ivop raises above, currently just trying to get the speed up to something decent. Display is current doing alright, I get the Memo Pad display as expected (though my cursor is quite wacky!)

 

 

Link to comment
Share on other sites

Some cartridge ROMs don't seem to be recognized and just go back to the Memo Pad. What might be causing that? For example: Donkey Kong. (of course I'm not sure where I picked it up, I have found a great resource in at archive.org which has a bunch of bin and atr files - which I'm assuming are just ROM dumps at they seem to mostly be 16kb).

 

On closer look some of these are not 16kb, so are not from cartridges.

Edited by robus
Link to comment
Share on other sites

Actually I'm not quite understanding the cartridge layouts. It seems like there's 16Kb available for cartridges which can be split into Cart A and Cart B on the 800. For the 400 cart A addresses all the 16Kb?  Trying to understand this from the Technical Reference:
 

8000-9FFF = Cartridge B, Cartridge A (half of 16K size) or RAM
A000-BFFF = Cartridge A or RAM

 

Some of these Roms are 16Kb which I was trying to put into A000!

Link to comment
Share on other sites

The left cartridge port is a superset of the right cartridge port -- it contains the decoding signals for both the left window at $A000-BFFF and the right window at $8000-9FFF as well as the cartridge control region at $D500-D5FF, while the right cartridge port only receives the latter two. Thus, a cartridge in the left port can map any region the right port can.

 

 

  • Like 1
Link to comment
Share on other sites

16 minutes ago, robus said:

Ok so when I load a 16 kb rom I should split it across cart B and cart A so that it fills the entire address space? On the 400 though, it has cart A only or is it actually cart B?

Map a plain 16K cartridge image straight over $8000-BFFF, first 8K at $8000 and second 8K at $A000. The 400 is no different -- it just has no right cartridge slot, but there's no difference in left carts even when mapping both cart windows.

 

  • Like 1
Link to comment
Share on other sites

Now I’m wondering about WSYNC with this new pull approach.
 

Before I used a semaphore to hold the CPU thread until the next scan line, but now it’s all single threaded that won’t work. The CPU code needs to suspend when it writes to WSYNC and wait for the next scan line cycle opportunity. Doing that cleanly seems tricky and likely ugly?

Link to comment
Share on other sites

1 hour ago, robus said:

Now I’m wondering about WSYNC with this new pull approach.
 

Before I used a semaphore to hold the CPU thread until the next scan line, but now it’s all single threaded that won’t work. The CPU code needs to suspend when it writes to WSYNC and wait for the next scan line cycle opportunity. Doing that cleanly seems tricky and likely ugly?

There's no need to do any tricky suspension; the CPU can just hold off executing the next instruction and skip or execute dummy cycles until it's been cleared to continue. WSYNC doesn't affect anything in the chipset besides pulling RDY on the CPU, so the rest of the emulation won't care about the particulars of this synchronization. Later on if you go for high accuracy you'll need to do something similar to model HALT on the CPU anyway.

 

Technically, the full way this works is:

  • ANTIC lets one additional cycle pass before asserting RDY. This one cycle delay occurs regardless of whether the CPU is halted or not during this cycle.
  • On the second cycle after the write to WSYNC, the CPU will stop until the start of horizontal blank if it is trying to issue a read cycle. On standard hardware this will always be an instruction fetch as the first two cycles of any instruction or interrupt sequence are always instruction fetches, and no instructions that can write to WSYNC do read cycles after write cycles (the stack can't reach it).
  • If the second cycle is a write cycle, the 6502 will execute that write cycle and then stop on the next read cycle. This can only happen with a read/modify/write instruction, e.g. INC WSYNC.
  • The 6502 cannot handle interrupts during a WSYNC wait. The next earliest opportunity to start interrupt processing is after the entirety of the WSYNC wait. This will only happen if the CPU was already about to enter interrupt processing when WSYNC was written. If the interrupt registers during the WSYNC wait, then the interrupt sequence can only start after the next instruction.
  • The 6502 is effectively retrying the read cycle repeatedly during the WSYNC wait. It is possible but very hard to observe this -- it requires having the CPU stop while reading an address that either has a side effect on reads (PORTA/PORTB), or autonomously changes and the data is observed by ANTIC reading from an open bus.

It'll be some time before you have to worry about this level of detail.

 

  • Like 1
Link to comment
Share on other sites

2 hours ago, Rybags said:

WSync can't inadvertantly trigger the missed NMI condition like a correctly timed IRQ, right?

Not as far as I know, I tested this at various cycle offsets and didn't see a case where the NMI was dropped. NMIs get dropped when the 6502 gets confused by an overlapping IRQ and services the IRQ when it should service the NMI; this isn't a problem with WSYNC/RDY as it isn't an interrupt. That being said, WSYNC does of course royally screw up the timing of any DLIs that try to fire at the same time.

 

Link to comment
Share on other sites

OK - got 16Kb carts loading and a rudimentary WSYNC/RDY system, I thought it would clean up some of the flickery display issues I'm seeing, but it's probably issues in my Display List rendering and font handling. I'll have a look at that next, but it's very nice to see some ROMs loading and at least showing something (I was getting tired of Memo Pad! :) )

 

P/M graphics are going to be interesting!

 

Thanks again for the help, I really appreciate it!

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...