About This Club
Atari VCS/2600 development using the Harmony/Melody board.
See the GENERAL discussion area for setting up your environment.
See the CDFJ discussion area for the tutorial.
- What's new in this club
-
Piledriver joined the club
-
Siddikinz joined the club
-
You are correct on most of these points. However, in practice the speed difference in 70 vs 60 is smaller because when running at 60 Mhz I've been successfully using fewer flash wait states so the actual difference is less than 16%.
-
RL's kept me busy so I still need to review the CDFJ+ discussions, but if I recall correctly: There's 2 startup routines in CDFJ+: Launched via Harmony Menu Flashed onto Melody board The Menu startup routine does not do anything to the MHz, so it's left at the 70 MHz that the Harmony Menu boots up at. When flashed to a Melody board, the CDFJ+ driver specifically sets it to 60 MHz. Routines will take about 16% longer time to execute at 60 MHz vs 70 MHz. For some games that's OK, but for others Vertical Blank and/or Overscan routines will take too long to run, resulting in screen jitter and/or roll. I think there was a chip issue that limited it to 60 MHz, but other chips are now available that can handle 70 MHz. The existing supply of 60 MHz chips can be used for batari BASIC DPC+ games, as they don't need the extra 10 MHz. If this is correct, then the Melody startup sequence needs to set the board for 70 MHz. If that's not correct then the Menu startup sequence should set the board to 60 MHz so we have proper results when people test play on a Harmony Cart.
-
I'm still not exactly clear what the issue is with this - is it that we need to run certain games on the Melody board at 70MHz, but it doesn't work? Or doesn't work on some carts? Is this a chip issue? A driver issue? Is it even a solvable issue? I know this is holding up Zeviouz at the moment too, and I think it had an impact on some of the Turbo carts John shipped (but he could better speak to that). I'm not a developer (and I don't even play one on TV), but I'm actively involved in helping to get some of the affected games published, and I feel there's a mounting sense of frustration over this. Respectfully, - Nathan
-
OK, as long as no NDA based secrets are creeping into Stella, this should be fine. Well, if you make the entire ROM "visible", then there is not much secret left.
-
It's for the 2600+, not CDFJ drivers. Of course I can talk about the new CDFJ driver itself. All I am doing with that is making essentially the entire ROM "visible" to the console so that future carts can easily be read by the 2600+. As for how existing CDFJ games will work on the 2600+, that gets into the algorithms in the 2600+ firmware and I think discussing that publicly may be verboten.
-
Well, some "who" requested you to sign the "what", right? And these requests are what worries me. Why would there be an NDA for homebrew drivers? I can only think of monetary interests (correct me if I am wrong). And again it would be money that spoils the fun (at least for me).
-
Yes, I know who, and their initials are N.D.A... and I guess it's less of a "who" and more of a "what"...
-
Huh? Can you tell who is forcing you to act like this? BTW: I do not like more of this creeping into our hobby.
-
Well, without revealing confidential information that I am privy to, I can say the following: Although current CDFJ carts can be supported, there is no way to dump arbitrary CDFJ carts. Basically, every individual CDFJ ROM needs its own individual implementation in the dumper. While this can theoretically be done in many cases, it is certainly not sustainable going forward. So as John said, we are looking at changing the CDFJ driver(s) to allow them to be supportable using a general method, so all future carts will work without having to expressly add each and every one. Sorry if I am being vague, but y'know, confidential information Anyway, I may have figured out a way to modify the CDFJ driver(s) to be supportable without using up any extra space. I'm doing some experiments with it now.
-
I'd like to finally get the 60 Hz / 70 Hz issue resolved. We're also looking into rereleasing Draconian, and I'd love to have it support the 2600+. Considering the larger ROM size, I could see increasing the size of the CDFJ driver if we need to, doesn't need to be a full 1K, could be 256 or 512 bytes.
-
Hey Fred - we should touch base and see if we can figure out a way to get Tutankham Arcade running on a 2600+ *before* I release it and if there is anything we can do to the driver/source code to make it easier. Let me know if there's anything I can do to help! I have a 2600+ and Tutankham Arcade ROMs; the only thing I lack is the knowledge of how to change the CDFJ+ driver.
-
The "CDFJ++" is something I came up with, but it really is meant to be CDFJ+ rev2 or something like that. I proposed that we modify the existing CDFJ+ driver to support the copy memory feature that is in DPC+ that @batari leveraged to allow the 2600+ to dump DPC+ games. I am releasing Tutankham Arcade in a few months and it uses CDFJ+ and my hope was that some sort of update to the driver could be made before it's released so it can be more easily dumped by the 2600+. In your defense TJ, it was myself and @SpiceWare that made the changes to Stella to support CDFJ+ LDX/LDY fastfetch and the configurable data stream offset so that's probably why it didn't ring any bells. 🔔
-
I'm waiting for CDFJ+ Platinum.
-
Looks like I lost track of all the versions.
-
Yes. And it's the bane of StellaDS mainly due to the optional features of LDX # and LDY # as fast fetchers (less so due to the much larger ROM/RAM space). But you must have known that as Stella implements CDFJ+
-
wavemotion joined the club
-
Why '++'? Does '+' exist already?
-
No more details yet - the conversation is ongoing.
-
batari joined the club
-
D Train joined the club
-
I've heard something about CDFJ++, which apparently can be dumped by the Atari 2600+ and can play on that device. @batari @SpiceWare @johnnywc Anyone got more details on CDFJ++?
-
SuperZapperRecharge joined the club
-
Slipqueue joined the club
-
Capellão joined the club
-
texpat joined the club
-
sfish joined the club
-
SiLic0ne t0aD joined the club
-
JeremiahK joined the club
-
undefinedopcode joined the club
-
SJD69 joined the club
-
maddigor2 joined the club
-
oceanix42 joined the club
-
Dave C joined the club
-
SledgeHammerD joined the club
-
bent_pin joined the club
-
tabytha joined the club
-
garycimera1968 joined the club
-
Hardware Division via Multiplication
Andrew Davie replied to SpiceWare's topic in Harmony/Melody's General
Just reading up on CDFJ+ to see how to get up to speed, and came to this page as part of the documentation. I just thought I'd add that for division by known constant, I pretty much let the compiler do all the work for me. I usually allocate 16 bits for fraction and do this... divide_by_5 = (value * (0x10000 / 5)) >> 16; divide_by_17 = (value * (0x10000 / 17)) >> 16; No need for tables or looking up shifts/values. -
Shquata joined the club
-
tross2600 joined the club
-
Magovinna joined the club
-
Old-Number-7 joined the club
-
Emyxox71 joined the club
-
Speed up your memset and memcpy with loop unrolling
bithopper replied to bithopper's topic in Harmony/Melody's General
Interestlingly, Dionoid's size optimized code is not only smaller, but also performs slightly better than the original. The original has some register saving overhead; for some reason it needs one register more than the other simple loop memsets. When I played around with thumb16 code, I found that incrementing the pointer instead of indexing often leads to slightly better loop results. At least with the compiler flags and options used. Here, the optimizer found and used a special ARM instruction (STMIA). This is why the presented alternate size optimized single loop code is two more bytes smaller (12 bytes) and also slightly faster. -
Speed up your memset and memcpy with loop unrolling
bithopper replied to bithopper's topic in Harmony/Melody's General
The gist: myMemsetInt_br8 always runs faster, right from count 1, and it's average speed quickly increases with the size to fill, until it reaches almost 7 times (6,85) that of the original loop and still nearly 6 times (5.7) that of the fastest single loop in this comparison. To be exact: To set one integer (count 1) takes with 19 cycles (including the call overhead for non-inlined code) roughly the same time as most variants. The original is slower at the beginning as it has some initialization instructions. It needs 32 cycles. At count 8 a little setback takes place, reducing the average speed, however the power of the block loop kicks in now and makes up for it very quickly (see data series and chart). Important here: We are never slower than any of the the simple loops, even in this 'fast block preparation phase'! Some numbers: To fill 7 integers takes less than half the time of the fastest single loop, 17 integers a third, 40 a quarter, 120 a fifth, and to fill/clear 1000 integers only takes a bit over a sixth (5.6) of the time, with lean 1.8 cycles per filled integer. So much for the theory - now let's see it in practice. -
Speed up your memset and memcpy with loop unrolling
bithopper replied to bithopper's topic in Harmony/Melody's General
Got some questions about how much actually is gained by using the presented methods. As I had no detailled answer to that, only some promising observations in my game effort, here some calculations, based on a data sheet with cycle timings (from ARM, also attached). This might not completely replace actual measurements on real hardware but is probably not too far off to shed some light on it. The attached PDF analyzes cycle count for four relevant variants discussed in the topic and compares them, also graphically. It was done in Excel and I hope the cycle calculations are correct - you might do a countercheck and tell me where I miscounted... (In case you are wondering: The cycle count shown is calculated for a non-inlined function, so it includes the cycle counts for the method call and its return.) In case the PDF doesn't speak for itself and needs some explanation, please let me know as well. myMemsetInt_CycleComparison_4Variants.pdf -
Speed up your memset and memcpy with loop unrolling
bithopper replied to bithopper's topic in Harmony/Melody's General
Here the corresponding optimized custom memsetInt in native assembler code. The same principle applies as with the assembler memcpy in my last post: The memory is set blockwise, 8 integers (i.e. 32 bit words) per loop pass, using well suited special ARM instructions (thanks, Thomas) up to a point where the remaining number of integers/words is below blocksize. These are worked then without another loop. Once again the aim was a reasonably low overhead for small(ish) sizes of memory to set. Nevertheless, when filling/clearing larger buffers, the method still performs well. It's size is ~48 bytes of arm thumb code. Remember: The destination pointer needs to be 4-aligned. void myMemsetInt_br8(unsigned int* destination, unsigned int fill, int count) { // (C) 2023 oliver gross aka bithopper - use at your own RISC // V0.9.230219_0 asm volatile( "cmp r2, #8\n\t" "blt .LL02\n\t" "push {r4-r6}\n\t" "mov r4, r1\n\t" "mov r5, r1\n\t" "mov r6, r1\n\t" "sub r2, #8\n\t" ".LL01:\n\t" "stmia r0!, {r1, r4-r6}\n\t" "stmia r0!, {r1, r4-r6}\n\t" "sub r2, #8\n\t" "bpl .LL01\n\t" "add r2, #8\n\t" "pop {r4-r6}\n\t" ".LL02:\n\t" "lsl r2, r2, #1\n\t" "mov r3, #13\n\t" "sub r3, r2\n\t" "add pc, r3\n\t" "stmia r0!, {r1}\n\t" "stmia r0!, {r1}\n\t" "stmia r0!, {r1}\n\t" "stmia r0!, {r1}\n\t" "stmia r0!, {r1}\n\t" "stmia r0!, {r1}\n\t" "stmia r0!, {r1}\n\t" : : : "r3", "cc", "memory" ); } The code seems to run fine in my WIP, compiled with GCC, so I decided to post it here. However: This is a beta version! You are welcome to use it, but be warned: It is not thorrowly tested and there might be bugfixes and other updates in near future. Please let me know of any issues or suggestions, feedback is most welcome 🙂 -
Prizrak joined the club
-
Speed up your memset and memcpy with loop unrolling
bithopper replied to bithopper's topic in Harmony/Melody's General
Spent another couple of hours hours at the bytewise memcpy, as the result of the C compiler optimization still left something to desire. After several frustrating attempts I took a deep breath and - decided to write a custom native Thumb16 code memcpy. Maybe it is also of use for you fellow hero (or heroine?!) of the VCS. I reduced the unrolling to 8, now featuring two phases: A faster 'block' copy (although still single value copy instructions, the multiple register load/store functions appear to work only with whole words), followed by 0 to 7 final single byte copies. As the overall size is ~108 bytes, this implementation will not be suited for every VCS project (most certainly not for Dionoid 's Loderunner ). It is still optimized for small copy sizes, though. I aimed for a low overhead when copying just a few bytes. When copying more than a few dozen bytes, we could use another memcpy implementation that aligns source and destination pointers and then ends up in a super fast integer copy. These algorithms are way faster for big copy sizes. However they tend to be even bulkier, with a sobering overhead ratio for the first few bytes (how big are your sprites, again? ) void myMemcpy_br8(unsigned char* destination, unsigned char* source, int count) { // (C) 2023 oliver gross aka bithopper - use at your own RISC // V0.9.230215_1 asm volatile( "cmp r2, #7\n\t" "ble .LL2\n\t" "sub r2, #8\n\t" ".LL3:\n\t" "ldrb r3, [r1]\n\t" "strb r3, [r0]\n\t" "ldrb r3, [r1, #1]\n\t" "strb r3, [r0, #1]\n\t" "ldrb r3, [r1, #2]\n\t" "strb r3, [r0, #2]\n\t" "ldrb r3, [r1, #3]\n\t" "strb r3, [r0, #3]\n\t" "ldrb r3, [r1, #4]\n\t" "strb r3, [r0, #4]\n\t" "ldrb r3, [r1, #5]\n\t" "strb r3, [r0, #5]\n\t" "ldrb r3, [r1, #6]\n\t" "strb r3, [r0, #6]\n\t" "ldrb r3, [r1, #7]\n\t" "strb r3, [r0, #7]\n\t" "add r0, #8\n\t" "add r1, #8\n\t" "sub r2, #8\n\t" "bpl .LL3\n\t" "add r2, #8\n\t" ".LL2:\n\t" "lsl r2, r2, #3\n\t" "mov r3, #55\n\t" "sub r3, r2\n\t" "add pc, r3\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" : : : "r3", "cc", "memory" ); } A word of caution - although it performed nicely in my WIP, using GCC, it is not thorrowly tested, so please regard this as highly experimental code that might still see some updates in near future. (Did I mention it is now highly platform dependend?) If someone uses this method and finds a bug, or has some other suggestions about the ARM code please let me know! -
Forgot to update the Tips and Tricks topic with this, it's now in there.
-
Recently Browsing 0 members
- No registered users viewing this page.