Math speedup ROMs

Maury Markowitz · September 27, 2017

A number of ROMs include code to speed up math routines - Omni had one IIRC and I recall another as well.

Does anyone know exactly what these routines did in order to run faster? I recall Bill Wilkinson telling me the original routines were written by "the new guy" and could have been much faster, but I am not clear if this is what they did.

Likewise, these same ROMs often included graphics speedups... similar question, does anyone know what exactly these did?

pirx · September 27, 2017

optimised code, that is all

Rybags · September 28, 2017

The story I read (here) was the original FP routines were based on generic examples of the time.

They chose to use BCD where other systems like C= use a binary representation which allows faster calculations.

Quicker line draws - I've not heard of an alternate OS which does this. The default routines are poor performing due to being flexible to work in any graphics mode as well as being written to take a low amount of space in Rom and to use very little Ram. But even without taking a huge amount more the OS routines could be sped up by a worthwhile amount by better programming.

phaeron · September 28, 2017

The most common technique used by fast math packages is to unroll the add/subtract loops. This can take a lot of space, so often the international character set area at $CC00-CFFF is stolen for the additional code. There are fast math routines that stay within the $D800-DFFF routine, just by using tightly optimized code and improved multiply/divide algorithms.

My OS has faster line draw and fill -- 3.6x and 16.2x over the XL/XE OS -- but it was more of a side effect of optimizing for size. The main reason the OS line/fill routine is slow is because it calls a PutPixel(x,y) routine which has to compute the row address over and over. Switching to direct plot with rotating bit masks is not only faster, it frees up space to move more temp variables to page zero, offsetting code size. The thing is, speeding up line/fill is not that interesting because you still have to go through CIO to get to it and too much of a speedup can virtually break programs that depend on those routines being slow.

Mclaneinc · September 28, 2017

I just love the fact that people today are taking the old Atari and looking at the OS, the BASIC and the hardware and fixing / upgrading the code, it would so easy just to say it is what it is but no, people are dedicating their time to make things better., you just have to both applaud and love that commitment.

As always I salute you guys (and gals ?) who do stuff for the Atari (and emulation, hardware and software in general), it really is appreciated and although I'm not a coder (well a really rubbish one) I do appreciate the time these things take, its rarely just a minor change in the source, its UI rebuilding, checking it does not break things, actual coding time and loss of free time amongst many other things. Thank you all and I know that isn't just from me, its from us all..

Edited September 28, 2017 by Mclaneinc

thorfdbg · September 28, 2017

Does anyone know exactly what these routines did in order to run faster? I recall Bill Wilkinson telling me the original routines were written by "the new guy" and could have been much faster, but I am not clear if this is what they did.

Likewise, these same ROMs often included graphics speedups...

The math functions in the original math package are naive, in multiple ways. So we have the bcd to integer conversion which naively just works by doubling and halfing the BCD/integer part of it, with multiple operations required per bit. Instead, one can convert entire digit-pairs at once, and one can also stop early if an overrun occurs.

Multiply is equally trivial. To multiply a number by 9, the current implementation adds the number nine times in BCD. One cannot only speed it up by unrolling the loop, one can also precompute the number *2, *4 and *8, and then perform a minimum number of adds required (or break it down in any other way). Unfortunately, this requires more ROM space, and more RAM space as temporaries which is why TurboBasic occupies more RAM in the lower ram end.

Graphics speedup: The line drawer/fill function is build on top of the plot function, so it has to recompute the pixel shift and pixel mask every time again. This is not necessary and instead of updating positions, one can also update masks and shifts as the line moves along. It's a bit unclear whether this requires more or less ROM than the original, but a couple of third-party RAM based line drawing functions can make this *a lot* faster (probably by a factor of 10 or so), but are then less generic than what the Os does.

JamesD · September 28, 2017

The story I read (here) was the original FP routines were based on generic examples of the time.

They chose to use BCD where other systems like C= use a binary representation which allows faster calculations.

Quicker line draws - I've not heard of an alternate OS which does this. The default routines are poor performing due to being flexible to work in any graphics mode as well as being written to take a low amount of space in Rom and to use very little Ram. But even without taking a huge amount more the OS routines could be sped up by a worthwhile amount by better programming.

Commodore BASIC is Microsoft BASIC.

The math routines in Microsoft BASIC are largely examples you'd find in a class telling you how to add, subtract, etc...

They aren't unoptimized, but space is a greater consideration than speed... you certainly won't find any unrolled loops.

Since the code uses single precision(?), you have a sign bit, an 8 bit mantissa, and 3(?) bytes to hold the exponent.

I'm not sure how that compares to Atari's BCD code, but normalizing floating point numbers occupies a lot of CPU time.

*edit*

I haven't looked at the 6502 math lib to be sure. I just know it's lower precision than the 68xx ones.

Edited September 28, 2017 by JamesD

Rybags · September 28, 2017

Pretty sure C= exponent would be just the 1 byte like Atari.

Overall, doing the mantissa in binary rather than BCD saves lots of space for the precision it gives.

e.g. 4 bytes bin = 4 billion vs BCD which is only 10,000.

Atari uses 6 bytes for FP numbers - so 10 decimal places with the remaining byte for the exponent.

JamesD · September 28, 2017

I reversed the exponent and mantissa

Maury Markowitz · September 28, 2017

The most common technique used by fast math packages is to unroll the add/subtract loops. This can take a lot of space, so often the international character set area at $CC00-CFFF is stolen for the additional code. There are fast math routines that stay within the $D800-DFFF routine, just by using tightly optimized code and improved multiply/divide algorithms.

My OS has faster line draw and fill -- 3.6x and 16.2x over the XL/XE OS -- but it was more of a side effect of optimizing for size. The main reason the OS line/fill routine is slow is because it calls a PutPixel(x,y) routine which has to compute the row address over and over. Switching to direct plot with rotating bit masks is not only faster, it frees up space to move more temp variables to page zero, offsetting code size. The thing is, speeding up line/fill is not that interesting because you still have to go through CIO to get to it and too much of a speedup can virtually break programs that depend on those routines being slow.

Great info phaeron! I do recall people using the timing of the various draw routines as a way to time things, which was kind of dumb when you consider you had a fairly good clock in the video system that you could use.

And you too thorfdbg, that is precisely the sort of explanation I was looking for.

Rybags, that thing you read is perhaps something I wrote. About a decade ago I had a nice conversation with Bill and wrote up a little web page - text only - about what I learned. The info from that has been diffusing out ever since. Basically, the book they were using didn't have multiply and divide, so instead of writing routines for this, the guy they had doing it just looped adds. Bill seemed a bit embarrassed about that one (as much as one can gleam from an email anyway).

JamesD · September 28, 2017

...

Rybags, that thing you read is perhaps something I wrote. About a decade ago I had a nice conversation with Bill and wrote up a little web page - text only - about what I learned. The info from that has been diffusing out ever since. Basically, the book they were using didn't have multiply and divide, so instead of writing routines for this, the guy they had doing it just looped adds. Bill seemed a bit embarrassed about that one (as much as one can gleam from an email anyway).

The textbook routines to multiply and divide use looped adds and subtracts.

They normally don't call add and subtract functions but that's what they do.

For the floating point multiply Microsoft uses, you also bit shift right once for each pass through the loop which is 8 times for each byte of the multiply.

So you end up looping 8 times per byte to be multiplied, adding each byte every pass, and shifting each byte of the result for every pass.

When that's done you have to normalize the floating point numbers which involves bit shifting left until your mantissa has no zeros on the left.

Or something like that. It's slow.

Division is similar.

Math speedup ROMs

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members