ANTIC DMA screen memory and display list accesses

robus · September 21, 2022

I've a feeling that my ANTIC implementation is hogging the bus because of all the requests for screen memory and dlist instructions when processing characters for rendering.

I imagine that ANTIC grabs a row of bytes at a time so that the CPU isn't held up for long, but I don't see anything about ANTIC having any RAM buffer?

Rybags · September 22, 2022

Antic has a buffer that holds character values or the bitmap data if in a bitmap mode.

There is no buffer for character set data (except maybe a couple of bytes worth of shift-register) so e.g. in Antic modes 5 and 7 that have double-height characters, the character set data is actually fetched twice.

The timing isn't necessarily exact as you'd see it on the screen. The Altirra Hardware Manual has some good diagrams that show DMA timing.

Also note Antic generates the Ram refresh signals, usually 9 per scanline. The timing and number of these can vary depending on screen width, H-scrolling and if it's the first line of a 40 character mode.

phaeron · September 22, 2022

ANTIC does actually block up the 6502 for long periods at a time. In the highest resolution character modes, it can block the CPU for almost the entire scanline since reading both character names and data takes the full memory bandwidth. But overall, it's not that bad -- a GR.0 screen is one of the heaviest standard modes and still only takes about 30% of the available non-refresh cycles for playfield DMA.

Also, even when reading into the internal buffer, ANTIC never does burst reads -- all DMA cycles are spaced out and interleaved according to when they're used.

robus · September 22, 2022

OK, good to know. I’m seeing the Memo Pad text coming out like it’s being typed, rather than instantaneous (this is after I reworked my CPU emulator to be more accurate in the number of cycles an instruction should take)

I believe the Memo pad text should appear instantly? I guess it’s time for to install another emulator for comparison purposes.

Rybags · September 22, 2022

Memo pad goes through E: so there's all that overhead between key interrupt and actual character appearing.

If you're doing low-level emulation then you don't really worry about that sort of thing since each little component acts independently and the graphics emulation only has to do memory fetch for DList, graphics data, character set etc.

phaeron · September 22, 2022

A global check you can do is to compute a histogram of program counter addresses and periodically dump out the top addresses to see where the emulated 6502 is spending most of its time. When no keys are pressed, it should be spending the majority of the time in the keyboard (K:) wait loop with only a small fraction of time in the vertical blank interrupt code. The profile should show whether the 6502 is executing around the expected number of instructions per frame or if it is spending an abnormal amount of time in a specific OS routine.

One possibility for where things may be going wrong is the key click handler -- such as if STA WSYNC is waiting too long, for instance. Keys should be processed in less than a frame end-to-end as long as the screen isn't being scrolled.

robus · September 22, 2022

Atari800MacX running AltirraOS shows me that I'm definitely not running efficiently yet!

robus · October 10, 2022

I've done a bit of profiling and the emulator is running quite slowly. About 10% of the real speed (150k cycles per second instead of 1.7M!) The code is pretty simple so I think it's the overhead of objc_msgsend (though that surprises me quite a bit given the performance of the hosting computer, a MacBook M1 Pro).

phaeron · October 10, 2022

4 hours ago, robus said:

I've done a bit of profiling and the emulator is running quite slowly. About 10% of the real speed (150k cycles per second instead of 1.7M!) The code is pretty simple so I think it's the overhead of objc_msgsend (though that surprises me quite a bit given the performance of the hosting computer, a MacBook M1 Pro).

The M1 is a very efficient architecture, but it is subject to the same fundamental rules as any other CPU. Let's say it's running at the max clock speed of 3.2GHz. With the Atari system clock of 1.79MHz, this gives only about 1800 cycles per Atari cycle. The core emulation loop is also typically single threaded and full of branchy code, so the M1's core count isn't going to help and its wide execution units aren't going to be as effective. This means that you must absolutely minimize the amount of work that must be done on a per Atari cycle basis. In my emulator, the only parts that run per-cycle are the 6502 and the ANTIC bus logic; everything else is batched and runs on a global event queue. This is most important for POKEY, which will be very slow to emulate if you are ticking all four timers every cycle. The memory subsystem also needs to be fast so a large if() tree isn't getting run every time the 6502 does an instruction fetch or ANTIC reads another playfield byte.

It's been a while since I profiled Objective C code, but IIRC method calls are dynamically dispatched, which means high overhead and an inability to inline the target call. Instruments should pretty quickly show what is going on. Despite the effectiveness of modern pipelining and branch prediction, you don't want a large number of dynamically dispatched calls in your critical emulation loop.

robus · October 10, 2022

3 hours ago, phaeron said:

The memory subsystem also needs to be fast so a large if() tree isn't getting run every time the 6502 does an instruction fetch or ANTIC reads another playfield byte.

Yeah, that was my first optimization to have a 256 page map to the various modules on the address bus, rather than searching for them each time, but that only gave me a 10k cycle speed up. I am running GTIA and Antic in a separate thread from the CPU, but you’re right that I’ve got too much overhead in the current implementation. And the ANTIC implementation is definitely not optimal

Thanks.

+Spaced Cowboy · October 12, 2022

An ObjC method call that is in the IMP cache (ie: it’s been called before, and recently) is about the same speed as a C++ virtual method call. It’s not as fast as a non-virtual method, or a C function call, but you can imagine that a *lot* of engineering time has been expended on making objc_msgSend() as fast as humanly possible, over the years. Interestingly, it’s actually not as fast as it used to be on x86, I think a trick or two has been sacrificed for some other feature. Objc_msgSend() used to be *faster* than a virtual method call…

in general, having the cpu do less on *every* iteration is of course A Good Thing, but the M1 has an enormous L1 cache, and a branch-predictor tied to that cache size, such that a predicted branch takes 1 clock or (more likely, assuming your jump distance is > 4kB) 3 clocks. Even unpredicted branches should never take more than 8 clocks, though. A huge series of IF’s should actually work fairly well - not that I’d advise it

phaeron · October 12, 2022

The call overhead isn't so much the problem as the compiler's loss of visibility into the called method, which in turn disables lots of optimizations. This is bad when you have something like a simple call in a loop that prevents big ticket optimizations like NEON vectorization. An example is the ANTIC module calling into GTIA to pass one color clock at a time -- you can get away with this if the compiler can inline directly or speculatively, otherwise there'll be quite a bit of overhead.

That having been said, trying to multithread the emulation is a far bigger culprit. Generally, running the chipsets at least at scanline granularity is needed for a high level of compatibility. That means updates at 15.7KHz, which is a pretty high rate to be shuttling data between threads. Furthermore, ANTIC is tightly coupled to the 6502, and there will be compatibility issues very quickly even if performance isn't an issue since the feedback cycle between ANTIC and 6502 is frequently only a handful of machine cycles. GTIA is more decoupled but is still within the loop due to the collision registers, which make the decoding, priority, and P/M graphics observable to the 6502. Only the parts after that can be safely buffered and run async with less latency concerns. POKEY is similar, as the timers and serial port are observable via IRQs, but the audio logic is not. You need aggressive and high performance thread communication for this to work, and the potential gains are questionable. Think the last time I benchmarked my single-threaded emulation on ARM, it ran at 25x real time on a rather old Snapdragon 835.

+Spaced Cowboy · October 12, 2022

A cached IMP message-send is essentially the same operation as a virtual method call - it’s an indirect jump through a function pointer. You pay a (small, ~1ns) cost on the first invocation and then it’s cached and ready to go if that method in that class is called again. In theory there’s an LRU for IMP’s, but I’ve never seen it actually evict an IMP - possibly if you have a huge number of informal protocols over NSObject, perhaps

The compiler ought to have full visibility into a method - the signature is known at compile time, and if the wrong signature is given at runtime by using dynamic IMP-swizzling, the result will be a mismatch in method lookup, and a quick crash. That would happen before any of the jump-to-routine code would be called, so on a successful message-send, the compiler can make as many assumptions and optimizations as it wants.

I guess if your C/C++ optimizations are sufficiently aggressive that they’re hoisting code blocks or data out of the method call (thus avoiding it altogether, an inline for example) then that’s different - of course you can then just rename your code to .mm not .m, and get inline methods via ObjC++, with the obvious trade-off that you lose the dynamic dispatch, which you presumably don’t care about because you marked it inline

One of the nice things about ObjC is that you can drop into C/C++ for the really-must-be-performant parts of the code, and get the nice features of ObjC for the remainder. It’s a true superset of C, so it is guaranteed to work with any C code you throw at it. How it manages C++ I am less sure about …

Anyway, I think this is going a bit astray from the original question, so I’ll stop hijacking the thread

Agreed that multithreading at that rate of context-switch is going to be an issue. Even running things on different cores, the synchronization between threads is going to pack a punch, and unless you’ve arranged to share the memory and you’re passing {position,length} tuples, copying the data could have an impact too.

robus · October 13, 2022

Thanks for the deep dive, folks. I started this as a learning exercise and it’s definitely that!

ANTIC DMA screen memory and display list accesses

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members