Fast divide by seven?

+batari · January 21, 2009

Hello Batari,

I tried to improve the code you posted, and I think I've saved 5 cycles, and 1 register (comparing the two used to hold the processer status, vs a second temp that could be in the same location) at a cost of 2 bytes. Though I haven't fully tested the code (and everytime I do that it bites me in the ass!), but I've been trying this for hours now and I'm tired. I think it works.

It works. I wrote a quick test program before I posted mine above, and yours passed

I thought there was a way to do it without the php/plp instructions.

Omegamatrix · January 22, 2009

Hello again, Batari.

I made a second attempt today. Can you please test it with your program also?

This one uses no branching and it takes 58 cycles all the time, which is 2-4 cycles faster then my last code. It also saves one byte over my last attempt. Its disadvantages are that it uses an index register. If the user doesn't have a free index register, they might be able to substitute PHA / PLA for TAY / TYA at a cost of adding 3 cycles, depending where the stack pointer is.

;divide by 10
;takes 58 cycles
lda byte	 ; 3
lsr		  ; 2
clc		  ; 2
adc byte	 ; 3
sta temp1	; 3
ror		  ; 2
tay		  ; 2
and #$80	 ; 2
sta temp2	; 3
tya		  ; 2
lsr		  ; 2
lsr		  ; 2
lsr		  ; 2
adc temp1	; 3
ror		  ; 2
ora temp2	; 3
lsr		  ; 2
lsr		  ; 2
lsr		  ; 2
adc temp1	; 3
ror		  ; 2
ora temp2	; 3
lsr		  ; 2
lsr		  ; 2
lsr		  ; 2

Edited January 22, 2009 by Omegamatrix

+batari · January 22, 2009

Hello again, Batari.

I made a second attempt today. Can you please test it with your program also?

This one uses no branching and it takes 58 cycles all the time, which is 2-4 cycles faster then my last code. It also saves one byte over my last attempt. Its disadvantages are that it uses an index register. If the user doesn't have a free index register, they might be able to substitute PHA / PLA for TAY / TYA at a cost of adding 3 cycles, depending where the stack pointer is.

It works.

I found that the clc isn't needed, so there's 2 more cycles shaved off. I tried to use some illegal opcodes but couldn't find any that worked.

Thomas Jentzsch · January 22, 2009

What's the task? Dividing an 8 bit number by 10?

+batari · January 22, 2009

What's the task? Dividing an 8 bit number by 10?

Yep.

GroovyBee · January 23, 2009

Another attempt :-

		ldx #0		 ; 2
	lda byte	   ; 3
	cmp #200	   ; 2
	bcc below200   ; 2/3
	sbc #200	   ; 2
	ldx #20		; 2
	jmp below100   ; 3
below200
	cmp #100	   ; 2
	bcc below100   ; 2/3
	sbc #100	   ; 2
	ldx #10		; 2
below100
	cmp #50		; 2
	bcc below50	; 2/3
	sbc #50		; 2
	sta temp	   ; 3
	txa			; 2
	ldx temp	   ; 3
	adc #5-1	   ; 2 (Carry set)
	jmp above50	; 3
below50
	sta temp	   ; 3
	txa			; 2
	ldx temp	   ; 3
above50
	cpx #10		; 2
	bcc done	   ; 2/3
	adc #0		 ; 2
	cpx #20		; 2
	bcc done	   ; 2/3
	adc #0		 ; 2
	cpx #30		; 2
	bcc done	   ; 2/3
	adc #0		 ; 2
	cpx #40		; 2
	adc #0		 ; 2
done

Lots of branches and variable execution time. Tested against the Batari original.

Omegamatrix · January 23, 2009

Hello, Batari and Thomas.

With Batari's findings of not needing CLC we are two cycles faster. I think I found a way to go another two cycles faster with an illegal opcode SBX.

;divide by 10
;54 cycles

ldx	#$80		; 2
lda	byte		; 3
lsr				; 2
adc	byte		; 3
sta	temp1	   ; 3
ror				; 2
sbx	#0		  ; 2   illegal opcode
stx	temp2	   ; 3
lsr				; 2
lsr				; 2
lsr				; 2
adc	temp1	   ; 3
ror				; 2
ora	temp2	   ; 2
lsr				; 2
lsr				; 2
lsr				; 2
adc	temp1	   ; 3
ror				; 2
ora	temp2	   ; 2
lsr				; 2
lsr				; 2
lsr				; 2

I have no idea if this illegal opcode is stable, but if it is then we are at 54 cycles and only 33 bytes. I hope this tests okay.

~Omega

Robert M · January 23, 2009

Here is a different approach yielding about the same efficiency.

; Divide unsigned value in A by 10.
; - How? Multiply by 26 (%11010), then divide by 256.
; Outputs :
; 	A = the answer.
;	X = 0 
;	Y = unchanged.
;	origvalue = the original value to be divided.
DivideBy10
LDX #0		  ;  2
STX temp		;  3
STA origvalue;  3

ASL			 ;  2
ROL temp		;  5

ADC	origvalue;  3
BCC .skip1	  ;  2/3
INC temp		;  5
.skip1	
ASL			 ;  2
ROL temp		;  5	 

ASL			 ;  2
ROL temp		;  5

ADC origvalue;  3
BCC .skip2		 ;  2/3
INC temp		;  5
.skip2	
ASL			 ;  2
LDA temp		;  3
ROL			 ;  2
				;-----
				; 56 cycles worst case
				; 48 cycles best case
				; 31 bytes

Edited January 23, 2009 by Robert M

+batari · January 23, 2009

The latest by Omegamatrix works. Robert M's made it to 68, then it failed. But this one is interesting - how does it work?

supercat · January 23, 2009

How about a hybrid table/code approach:

; Divide value in acc. by 10 and store the result in Y.  X trashed.
 tay
 lsr
 lsr
 lsr
 lsr
 tax
 tya
 sec; Could this be eliminated?
 sbc table1,y
 ldy table2,x
 cmp #10
 bcc nobump
 iny
 cmp #20
 bcc nobump
 iny
nobump:
...
table1:
 byte 0,1,3,4,6,8,9,11,12,14,16,17,19,20,22,24
table2:
 byte 0,10,30,40,60,80,90,110,120,140,160,170,190,200,220,240

Sixteen instructions totaling 36 cycles worst-case and 24+32=56 bytes

+batari · January 23, 2009

Supercat's works, with the following corrections:

  sbc table2,x
 ldy table1,x

The sec is required.

Edited January 23, 2009 by batari

Thomas Jentzsch · January 23, 2009

I wonder if the BCD mode could be used for this. :ponder:

Omegamatrix · January 24, 2009

I think I managed to save another 2 cycles, and 2 bytes by using the illegal opcode SAX.

;divide by 10
;52 cycles, 31 bytes

lda  byte	  ; 3
ldx  #$80	  ; 2
lsr			; 2
adc  byte	  ; 3
sta  temp1	 ; 3
ror			; 2
sax  temp2	 ; 3   illegal opcode
lsr			; 2
lsr			; 2
lsr			; 2
adc  temp1	 ; 3
ror			; 2
ora  temp2	 ; 2
lsr			; 2
lsr			; 2
lsr			; 2
adc  temp1	 ; 3
ror			; 2
ora  temp2	 ; 2
lsr			; 2
lsr			; 2
lsr			; 2

I can't believe how useful some of these illegal ops are in certain situations. Hopefully it is stable.

Edited January 24, 2009 by Omegamatrix

Robert M · January 24, 2009

Robert M's made it to 68, then it failed. But this one is interesting - how does it work?

I am multiplying the value by the fraction 26/256 which is just slightly larger than 0.1.

+batari · January 24, 2009

Robert M's made it to 68, then it failed. But this one is interesting - how does it work?

I am multiplying the value by the fraction 26/256 which is just slightly larger than 0.1.

Makes sense. It fails at 69 because 69*26/256=7.0078125

If you adjust the original value with value=value-value/64, it may fix it.

Omegamatrix · January 24, 2009

Batari, can you please test this one if you have time? I was going through distella's debugger with it, and I think I adjusted it right, but I need confirmation.

Thanks if you can,

Omega

;divide by 10
;47 cycles, 28 bytes

tax				; 2
sta	temp		; 3
lsr				; 2
adc	temp		; 3
sta	temp		; 3
ror				; 2
lsr				; 2
lsr				; 2
lsr				; 2
adc	temp		; 3
cpx	#161		; 2
ror				; 2
lsr				; 2
lsr				; 2
lsr				; 2
adc	temp		; 3
cpx	#160		; 2
ror				; 2
lsr				; 2
lsr				; 2
lsr				; 2

Robert M · January 24, 2009

Not as efficient anymore, but I changed it to multiply by the fraction (25.5/256) which is closer to 0.1,

; Divide unsigned value in A by 10.  "floor(A/10)"
; - How?
; - Multiply by 25.5 (%11001.1), then divide by 256  
; Inputs :
;	A = 0-255 value to be divided by ten rounded down.
; Outputs :
;    A = the answer.
;    X = 0 
;    Y = unchanged.
;    origvalue = the original value in A divided by 2.
DivideBy10
    LDX #0          ;  2; This is a setup work that can
    STX temp        ;  3; be optimized out if the value
    STA origvalue    ;  3 ; to divide is already in origvalue
				; and there is a temp location 
				; already set to zero.

    ASL              ;  2; temp:A = origvalue * 2 
    ROL temp        ;  5

    ADC origvalue    ;  3; temp:A = origvalue * 3
    BCC .skip1      ;  2/3
    INC temp        ;  5
.skip1    

    ASL              ;  2; temp:A = origvalue * 6
    ROL temp        ;  5     
    
    ASL              ;  2; temp:A = origvalue * 12
    ROL temp        ;  5
    
ASL			;  2 ; temp:A = origvalue * 24
ROL temp	;  5

    ADC origvalue    ;  3; temp:A = origvalue * 25
    BCC .skip2       ;  2/3
    INC temp        ;  5
.skip2    

LSR origvalue  ;  5 ; temp:A = origvalue * 25.5	
ADC origvalue  ;  3		  
    LDA temp	;  3
ADC #0		 ;  2 ; A = (origvalue * 25.5) / 256
                    ;-----
                    ; 69 cycles worst case
                    ; 61 cycles best case
                    ; 38 bytes

I am curious now to try and make a general solution that can multiply any byte by a fraction.

Edited January 24, 2009 by Robert M

+batari · January 24, 2009

Omegamatrix's works, but Robert M's fails at 10 (because 25.5*10/256=0.996.)

Robert M's original code will work if it's adjusted with origvalue=origvalue-origvalue/64.

It's doing (A-A/64)*26/256=A*25.59375/256, which is pretty close.

Unfortunately, I can't think of a better way to do it than this (add at the beginning of the code:

  sta origvalue
  rol
  rol
  rol
and #3
  sta temp
 lda origvalue
  sec
 sbc temp
 sta origvalue

EDIT: Corrected analysis (it's not exact, but works.)

Edited January 24, 2009 by batari

Omegamatrix · February 7, 2009

I took another stab at this today, and managed to save 11 cycles and 7 bytes over my last effort.

;divide by 10
;36 cycles, 21 bytes

tax				; 2
sta	temp		; 3
lsr				; 2
sec				; 2
adc	temp		; 3
sta	temp		; 3
ror				; 2
lsr				; 2
lsr				; 2
lsr				; 2
adc	temp		; 3
cpx	#160		; 2
ror				; 2
lsr				; 2
lsr				; 2
lsr				; 2

Edit: found a way to save another byte by setting the carry!

Edited February 7, 2009 by Omegamatrix

Omegamatrix · February 8, 2009

Found a way to save another cycle, and preserve the X and Y registers at the same time. Same amount of bytes as before.

;divide by 10
;35 cycles, 21 bytes

sta	temp		; 3
ora	#$03		; 2
lsr				; 2
adc	temp		; 3
ror				; 2
sta	temp		; 3
lsr				; 2
lsr				; 2
lsr				; 2
adc	temp		; 3
adc	temp		; 3
ror				; 2
lsr				; 2
lsr				; 2
lsr				; 2

+batari · February 8, 2009

Found a way to save another cycle, and preserve the X and Y registers at the same time. Same amount of bytes as before.
;divide by 10
;35 cycles, 21 bytes

sta	temp	; 3
ora	#$03	; 2
lsr			; 2
adc	temp	; 3
ror			; 2
sta	temp	; 3
lsr			; 2
lsr			; 2
lsr			; 2
adc	temp	; 3
adc	temp	; 3
ror			; 2
lsr			; 2
lsr			; 2
lsr			; 2

Now that's impressive Works, too.

+grafixbmp · February 8, 2009

Found a way to save another cycle, and preserve the X and Y registers at the same time. Same amount of bytes as before.
;divide by 10
;35 cycles, 21 bytes

sta	temp; 3
ora	#$03; 2
lsr		; 2
adc	temp; 3
ror		; 2
sta	temp; 3
lsr		; 2
lsr		; 2
lsr		; 2
adc	temp; 3
adc	temp; 3
ror		; 2
lsr		; 2
lsr		; 2
lsr		; 2
Now that's impressive Works, too.

Wonder if you could impliment a similar process for other numbers too. (still preserving the x and y)

Omegamatrix · February 10, 2009

Managed to shave another byte off. Same cycles as before.

;divide by 10
;35 cycles, 20 bytes

lsr				; 2
sta	temp		; 3
lsr				; 2
sec				; 2
adc	temp		; 3
sta	temp		; 3
lsr				; 2
lsr				; 2
lsr				; 2
adc	temp		; 3
adc	temp		; 3
ror				; 2
lsr				; 2
lsr				; 2
lsr				; 2

+grafixbmp · February 11, 2009

There is also another process a friend of mine introduced me to once, but can't quite remember what it was called. It was either pure hex or true hex or something to that effect. It might be more adept at being used on a 6502 since it deals with hex values. But while I collect some more info on the subject, here is a part of the processes used in vedic for multiplication. It may not work well for a processor but it is still nice to know. It deals with the fact that the math is done with base 10 and uses it as so...

Suppose you need 8 x 7

8 is 2 below 10 and 7 is 3 below 10.

Think of it like this:

The answer is 56.

The diagram below shows how you get it.

You subtract crosswise 8-3 or 7 - 2 to get 5,

the first figure of the answer.

And you multiply vertically: 2 x 3 to get 6,

the last figure of the answer.

I'll get back after I see what I can find on the hex stuff.

I found what I was looking for. it is Prime HEX or cuban primes. here is a link for it and what it is used for. You can search for more info on the subject and may find some helpful things to use in math on the atari.

http://mathworld.wolfram.com/HexNumber.html

Here is another site with some visual arrays that look pretty damn cool.

http://www.thefoggiestnotion.com/prime_hex_array.htm

Edited February 11, 2009 by grafixbmp

+grafixbmp · February 11, 2009

I thought about the topic that started it all, "Fast devide by 7" And decided to take an all encompasing aproach to an idea.

If a table array had the math layed out for every possible answer in decimal, bianary and hex, then a search for easy pattern recognition can be done to devise an efficient way to come to the right answer for groups of given senarios. That way you can have more than one way to get the answer depending on what the current value being worked with is. And given that the highest number you could even do with an 8 bit value is 255, then somewhere around up to say 4 diffrent ways could be constructed to obtain an answer that take extremely close to the same amount of cycles to accomplish the task since ROM space really isn't such a big issue. But then comes the decission making process of which method is used to get the answer based upon what the current value is that is being worked with. This part may be too much overhead. Sometimes visualizing the array of answers helps to see patterns where they could not be seen before. Basing this upon patterns in bianary or hex could be the key to optimizing the complete process. Sound good or is this pure smoke/speculation?

Edited February 11, 2009 by grafixbmp

Fast divide by seven?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members