my first assembly program

sometimes99er · March 30, 2011

Since the numbers are decimal, you're multiplying by 10 to build the number up. I ended up using the old 8x+2x trick, too lazy to use DIV

So, to multiply r0 by 10:
MOV R0,R1   ; COPY R0
SLA R0,3    ; MULTIPLY R0 BY 8
A R1,R1     ; MULTIPLY R1 BY 2
A R1,R0     ; R0=R0*10
Not sure if that would be faster than DIV - probably not, since the code is in 8-bit memory, however, it wasn't time-critical code, and it makes register planning a lot easier

I'm too lazy too. Would love to see you try multiplying using DIV though.

unhuman · March 31, 2011

For example if I want to divide by 48, I'll shift 5 and then divide by 3.

Is that a typo... I don't see how that would work?

Yeah. I'm a retard.

Tursi · March 31, 2011

So, to multiply r0 by 10:
MOV R0,R1   ; COPY R0
SLA R0,3    ; MULTIPLY R0 BY 8
A R1,R1     ; MULTIPLY R1 BY 2
A R1,R0     ; R0=R0*10
Not sure if that would be faster than DIV - probably not, since the code is in 8-bit memory, however, it wasn't time-critical code, and it makes register planning a lot easier

Hehe.. maybe not DIV, but anyway, yeah, let's break it down then. Assuming registers are in scratchpad (16-bit memory), and code you have said is in 8-bit memory, here's how to work it out.

I don't have the 9900 manual handy so I'll use Thierry's page as a reference. I'm sure they are the same.

To MPY by 10, we need to preconditions. First, that there is a 32-bit destination available, and second, that the number 10 is loaded somewhere to use.

The first has no performance cost, so we'll just assume it is true, and that we can use R1 and R2 for the result.

For the second, there are several ways to get the 10.

The most obvious, is LI (say into R3), LI R3,>0010 - This takes 12 cycles and 3 memory accesses. Note that LI is one of the rare instructions that does NOT read-before write. 2 cycles are for the instruction and data, and one is for the write. The two reads are in 8-bit memory, so cost 4 cycles each, and the write is in 16-bit memory, so costs 0. The total cycle cost of this instruction is 12+4+4+0 = 20 cycles.

Another method is to load from a memory location. If we assume that memory location is in 8-bit RAM, then MOV @VAL10,R3 takes 14 cycles plus 4 memory accesses - one reads the instruction, one reads the source, one is the read-before-write to the destination, and the fourth is the write itself. The @ addressing mode adds an addition 8 cycles and 1 more read memory access. It's all 8-bit except the register, so 4 cycles penalty per access, this instruction takes 14+4+4+0+0+8+4 = 34 cycles.

One you have that, you can perform the MPY R3,R0. MPY is a fixed cycle count instruction, that is, it does not loop adds internally but uses a predictable algorithm to gain the result. Thus it always takes 52 cycles, and 5 memory accesses. One reads the instruction, one reads the source, one reads the dest, and there are two writes to round it off. (Does this mean that MPY also does not read-before-write? I have not confirmed this but it suggests so.) The only memory access penalty in this case is the instruction read, so the cost is 52+4+0+0+0+0 = 56 cycles. Unless you can keep the '10' in R3, you would have to add one of the above issues, too, so it would total 76 or 90 cycles that way.

I think MPY can also take a general address as the source value, meaning that we could do MPY @VAL10,R3. That would just add the 8 cycle and 1 read penalty to the MPY itself. 52+4+0+0+0+0+8+4 = 68 cycles, and no preinitialization required. This thus comes closest to the above routine.

Now let's look at your routine. I'll abbreviate the explanation a bit and just give the cycle counts:

MOV R0,R1 - 14+4+0+0+0 = 18 cycles

SLA R0,3 - 12+(2*3)+4+0+0 = 22 cycles

A R1,R1 - 14+4+0+0+0 = 18 cycles

A R1,R0 - 14+4+0+0+0 = 18 cycles

Total: 18+22+18+18 = 76 cycles.

Interestingly, Thierry provides a similar example, but breaks it up differently:

SLL  R8,1 Multiply by two -- 12+(2*1)+4+0+0 = 18
MOV  R8,R9  -- 14+4+0+0+0 = 18
SLL  R8,2   And then by 4 (which makes  -- 12+(2*2)+4+0+0 = 20
A    R9,R8  Add it up: 2+8=10 -- 14+4+0+0+0 = 18

This one takes 18+18+20+18 = 74 cycles.

So what do we end up with, in the MPY x 10 case?

MPY takes from 56 cycles if you have everything set up, to 68 cycles if you don't. (You could make it 64 cycles using *Rx notation, but if you have the register to spare, you might as well load the 10 in it as an indirect reference to one).

And Thierry's version of the shift version takes 74 cycles. It is slower in this scenario by 6 cycles (less than one instruction time).

If the code and data is in 16-bit RAM, things change a little bit. Now all the memory access penalties are gone. In that case, the MPY version is now 60 cycles, but the shift version is only 58 cycles - it's now faster by 2 cycles (less than one instruction time).

What if code and data are all in 8-bit RAM (registers included!) This is a bit painful, but it may occasionally be necessary. In this case you want to stay away from RAM as much as possible and so complex instructions like MPY are markedly better. In that case, the non-prepared MPY is now 84 cycles, but the shift method goes up to 114 cycles!

What else is different? The non-setup MPY needs 6 bytes of memory, the shift version needs 8 bytes. The MPY requires two registers for the output, but the shift version requires a spare register too, so that's more or less equivalent.

This is of course a complex case. In the simple case, such as multiplying by 8, you are comparing the MPY (at 68 cycles) to one shift of 3 at 22 cycles (and no extra data or registers needed) - no contest, shifting is far superior. The complex case of two shifts and an add is so close in all cases except all 8-bit memory as to be a wash - you'll not see the difference. If you need more than two shifts, then MPY is the clear winner.

DIV is a different beast. DIV takes a minimum of 92 cycles, up to 124 cycles, and 6 memory accesses - just for the instruction. The varying time is because this does use an iterative approach. Four memory accesses are reads, and two are writes. (Again, no r-b-w?) If we use the 10 value example above, we can extend the fastest possible unprepared DIV (excluding overflow) to 92+4+0+0+4+0+0+8+4 = 112 cycles (DIV @VAL10,R0) - this assumes that R0 and R1 are already prepared, which is not necessary with shifting. The shift/add approach (using subtraction instead) would maintain the same cycle count, and now we see that at 74 cycles it is far faster to use two shifts and subtract than to use DIV. At three shifts you probably start to break even, except in code size. Getting the remainder of a non-power of two (even a "two-shift" number) is a little more difficult, at that point I'd probably use DIV too.

One place that breaking things up into shifts is very difficult, though, is when you are not using fixed values. If you don't know what numbers are coming in, and you can't guarantee that they fit a range, you really have no choice. MPY and DIV are going to be faster than looping yourself just due to memory overhead.

But the real question before you actually start tallying up all these numbers, if you are optimizing for performance anyway, is "why are we multiplying by 10 in the first place?". We know that it's going to be an expensive operation no matter how we do it, so we need to say, "well, can we do better?" Now, reading a string of numbers into a register is not often something that you would need to optimize for speed, but that's what we're working with here, so what can we do about it?

First, we know our input. It is a string of no more than 5 ASCII digits from 0-9, representing a number from 0-65535. Every fact that you can firm up is an optimization tool. Decimal numbers don't read very well into computer algorithms, so the first question to be asked would be "does it have to be decimal?" Changing the data can make a huge difference - if we instead mandated hexadecimal, then we would have four digits from 0-F to look at. There would be no chance of overflow because four digits is guaranteed to fit in 16 bits (while 5 decimal digits may not, consider 99999!) And the each digit can be converted by one or two subtractions (depending on 0-9 or A-F) and a 4-bit shift, which is much faster than any multiply by 10. This is an example of changing the data to better suit the program - when speed (or size!) is important, it is always worth looking at!

If you couldn't change the input format, but you still needed more speed, you could try reducing the amount of work done. For example, you could use a lookup table 100 bytes long and parse two digits at a time - then you would only need to multiply 3 times (x100 twice, x10 once), instead of 5 times. Even on the TI, 100 bytes is not too much to ask. (To do 3 digits at a time this way, you would be 2000 bytes, since the results would exceed one byte - but then you need only one multiply! Just lookup the first 2 digits, multiply by 1000, lookup the last 3 digits, and add the result.) If you DO have to use the expensive functions, then make sure you are using them with care.

At any rate, it's worth keeping all options in mind, because as I've tried to show above, the "best" approach really depends on your needs. Do you need simple code? Do you need maximum variable density? Do you need fewest cycles? Do you need smallest code? Can you use scratchpad? Is it worth the overhead of COPYING to scratchpad, or will that cost more time than you save? There are so many issues to consider in optimization that you have to look at the problem every time.

(hopefully all that math works out )

Edited March 31, 2011 by Tursi

+retroclouds · April 1, 2011

Excellent review Tursi! And thanks for taking the time to go into the gory details :cool:

Sign In

my first assembly program

Recommended Posts

sometimes99er

Link to comment

Share on other sites

unhuman

Link to comment

Share on other sites

Tursi

Link to comment

Share on other sites

+retroclouds

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members

Apps

My Activity Streams

More