Jump to content
  • entries
    17
  • comments
    28
  • views
    32,266

Unsigned Integer Division Routines


Omegamatrix

4,057 views

In general, multiplication in assembly is easy, and division is a bitch. There are three basic approaches to doing division. The first is to just do a loop in which the divisor is continually subtracted:

    lda    dividendValue
    ldx    #0    sec
.loopDivideBySeven
    inx
    sbc    #7
    bcs    .loopDivideBySeven
 

The advantage to this approach is that it takes very few bytes. On the other hand when the dividend is large lots of loops get taken, and each loop piles on the cycles. The completion time is variable, which for the 2600 usually means you have to figure out a worse case execution time to prevent unwanted scanline bounces. :(

 

A second approach is to use look up tables:

    ldy    dividendValue
    lda    DivideByFiveTable,Y
;... code continues

DivideByFiveTable:
    .byte 0,0,0,0,0  ; 0-4
    .byte 1,1,1,1,1  ; 5-9
    .byte 2,2,2,2,2  ; 10-14
    .byte 3,3,3,3,3  ; 15-19
    .byte 4,4,4,4,4  ; 20-24
    .byte 5,5,5,5,5  ; 25-29
    .byte 6,6,6,6,6  ; 30-34
 

This approach is the fastest way possible, aside from a trivial divide by "powers of two" case. The disadvantage... you use up gobs and gobs of rom if you want to cover a lot of values. Sometimes if feels like you have no choice when the cycles are tight, but there are usually other places to shave cycles.

 

 

These first two basic approaches seem at extremes to each other. In between these are the "hybrid" approaches, where you divide a little, and then follow up with a small look-up table, as demonstrated by Supercat.

 

 

The third approach is to use reciprocal multiplication. I became interested in doing these so called "fast" division routines a few years ago. The first thing the reader has to know is that these are not necessarily the "fastest" (tables are always faster), but are generally a good compromise in terms of bytes and cycles to the first two basic approaches. Here are some of the routines I have found. They cover divisors 2 to 32, are all constant cycle routines, and don't use X or Y registers. They are good for any value, 0 to 255. :)

 

 

New! The file below is now updated with some better routines.

 

DivisionRoutines(rev2).asm

  • Like 4

12 Comments


Recommended Comments

You can also do division as a "shift and subtract", which will reduce the number of loops to the number of shifts. However, the 6502 doesn't have the instruction set which will make this method efficient.

Link to comment

Hmm, I'll have to look into that reciprocal multiplication tonight. I'm got a 185 byte table in Frantic that's used for divide by 5 for dealing with the PF pixels that make up walls (each brick of the wall is 5 scan lines tall). The table's used in the calculations that check if a sprite or missile has collided with a wall. Due to flicker, all collisions are done in software. Also the Frenzy variation needs to know which segment was hit so it can remove the brick if it was part of a destructible wall.

Link to comment

The division routines are for 6502 assembly and highly optimized to take as little time and bytes as possible. I don't know if you could port it to C code or ARM assembly so easily. The divide by five looks like this:

;Divide by 5
;18 bytes, 30 cycles
  sta  temp
  lsr
  adc  #13
  adc  temp
  ror
  lsr
  lsr
  adc  temp
  ror
  adc  temp
  ror
  lsr
  lsr

The carry is never set or cleared, but its contents are depended on.

Link to comment

I ended up working late so haven't tried it yet. After reading the reciprocal multiplication page I believe the code would be something along these lines:

div5 = (orig * %001100110011) >> 12

if the values are slightly off, because 1/5 is an infinite repeating fraction when represented in binary, then extended the fixed point value such as

div5 = (orig * %0011001100110011) >> 16

In the Correcting the Errors section they mention that using these values would be accurate for all 16 bit dividends:

div5 = (orig * %001100110011001101) >> 18

Since I'm only dividing an 8 bit value, I believe the 12 or 16 should meet my needs.

 

What I plan to do is add a division test to the "menu" which tests the values 0-255. I'll use the score to display the test results - if a dividend doesn't divide correctly I'll show it in the score. If 0-255 all succeed then I'll show FFFFFF.

Link to comment

I dropped this in during lunch:

 

#define DIV_BY_5                0xCCCD
#define DIV_BY_5_SHIFT          18

/*
const unsigned char div_5[] = {
    0,  0,  0,  0,  0,
    1,  1,  1,  1,  1,
    2,  2,  2,  2,  2,
...
    36, 36, 36, 36, 36
};
 */

int PlayfieldSpriteCheck(int x, int y)
{
...
//        y5 = div_5[y];
        y5 = (y * DIV_BY_5) >> DIV_BY_5_SHIFT;
...
}

void MoveShots(int count)
{
...
//                        y = div_5[gMissileY[i] + y];
                        y = ((gMissileY[i] + y) * DIV_BY_5) >> DIV_BY_5_SHIFT;
...
}
It appears to work as the correct wall segments get shot out in the Frenzy variation and the humanoid still correctly collides with the walls. Before the change the ARM code was 11272 bytes, after it's 11100 for a 172 byte savings!

 

I still plan to do the div check before committing this change. I'll do that when I next work on Frantic, possibly tonight but most likely over the weekend.

Link to comment
#define DIV_BY_5                0xCCCD
#define DIV_BY_5_SHIFT          18

        y5 = (y * DIV_BY_5) >> DIV_BY_5_SHIFT;

 

I took a quick peek tonight, and I believe you can get by with this (good for values 0 to 259).

#define DIV_BY_5                0x33
#define DIV_BY_5_SHIFT          8

        y5 = ((y + 1) * DIV_BY_5) >> DIV_BY_5_SHIFT;

.

.

I did this by looking at the values in excel, and I saw that they were off by 1 when the transition to the next integer result occurred. Initially increasing the dividend by 1 fixes this.

 

I am curious how the compiler is computing the multiplication. I believe the arm chip in the harmony has no hardware multiplier. This is what I came up with in pseudo code:

    inc       ; +1
    sta temp
    asl       ; x2
    adc temp  ; x3
    sta temp
    asl       ; x6
    asl       ; x12
    asl       ; x24
    asl       ; x48
    adc temp  ; x51
    lsr >> 8
Link to comment

Thinking about is some more, it could be further improved since you are shifting left four times, and then right eight. So you might do something like this instead:

    inc       ; +1
    sta temp
    asl       ; x2
    adc temp  ; x3
    sta temp
    lsr >> 4
    adc temp
    lsr >> 4


or:

   temp = ((Y + 1)*3)
   result = ((temp / 16) + temp) / 16
Link to comment

I tested the 0xCCCD value I'd dropped in from the page and it worked as expected. When I tried the 0x33_ values it didn't work. I didn't wish to spend any more time on it, so I decided to leave it with the 0xCCCD value.

 

Based on this section of the listing, at 3320, I'd say the ARM has hardware multiply:

2341:main.c        ****     if (y < 36*5)
2342:main.c        ****         y5 = (y * DIV_BY_5) >> DIV_BY_5_SHIFT;
2343:main.c        ****     else
2344:main.c        ****         y5 = 36;
 3313              		.loc 1 2344 0
 3314 0004 2424     		mov	r4, #36	@ y5,
2341:main.c        ****     if (y < 36*5)
 3315              		.loc 1 2341 0
 3316 0006 B329     		cmp	r1, #179	@ y,
 3317 0008 02DC     		bgt	.L268	@,
2342:main.c        ****         y5 = (y * DIV_BY_5) >> DIV_BY_5_SHIFT;
 3318              		.loc 1 2342 0
 3319 000a 3D4C     		ldr	r4, .L282	@ tmp197,
 3320 000c 4C43     		mul	r4, r1	@ tmp196, y
 3321 000e A414     		asr	r4, r4, #18	@ y5, tmp196,
 3322              	.LVL220:
 3323              	.L268:

Per wiki:

The ARM supports add, subtract, and multiply instructions. The integer divide instructions are only implemented by ARM cores based on the following ARM architectures...(not what's in the Harmony)

Link to comment

Today I have found better routines for dividing 6,10,12,20,24,26, and 28. I removed the old file, and replaced it Rev 1 which contains the new routines. :)

  • Like 1
Link to comment

Upgraded the file to Rev 2. It has a new divide by 22 routine which is 3 cycles faster then before, and uses the same amount of bytes. :)

  • Like 1
Link to comment
Guest
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...