Jump to content
IGNORED

Fast divide by seven?


brpocock

Recommended Posts

Hello Batari,

 

 

I tried to improve the code you posted, and I think I've saved 5 cycles, and 1 register (comparing the two used to hold the processer status, vs a second temp that could be in the same location) at a cost of 2 bytes. Though I haven't fully tested the code (and everytime I do that it bites me in the ass!), but I've been trying this for hours now and I'm tired. I think it works.

It works. I wrote a quick test program before I posted mine above, and yours passed :)

 

I thought there was a way to do it without the php/plp instructions.

Link to comment
Share on other sites

Hello again, Batari.

 

I made a second attempt today. Can you please test it with your program also? :)

 

 

This one uses no branching and it takes 58 cycles all the time, which is 2-4 cycles faster then my last code. It also saves one byte over my last attempt. Its disadvantages are that it uses an index register. If the user doesn't have a free index register, they might be able to substitute PHA / PLA for TAY / TYA at a cost of adding 3 cycles, depending where the stack pointer is.

 

 

;divide by 10
;takes 58 cycles
lda byte	 ; 3
lsr		  ; 2
clc		  ; 2
adc byte	 ; 3
sta temp1	; 3
ror		  ; 2
tay		  ; 2
and #$80	 ; 2
sta temp2	; 3
tya		  ; 2
lsr		  ; 2
lsr		  ; 2
lsr		  ; 2
adc temp1	; 3
ror		  ; 2
ora temp2	; 3
lsr		  ; 2
lsr		  ; 2
lsr		  ; 2
adc temp1	; 3
ror		  ; 2
ora temp2	; 3
lsr		  ; 2
lsr		  ; 2
lsr		  ; 2

Edited by Omegamatrix
Link to comment
Share on other sites

Hello again, Batari.

 

I made a second attempt today. Can you please test it with your program also? :)

 

 

This one uses no branching and it takes 58 cycles all the time, which is 2-4 cycles faster then my last code. It also saves one byte over my last attempt. Its disadvantages are that it uses an index register. If the user doesn't have a free index register, they might be able to substitute PHA / PLA for TAY / TYA at a cost of adding 3 cycles, depending where the stack pointer is.

It works.

 

I found that the clc isn't needed, so there's 2 more cycles shaved off. I tried to use some illegal opcodes but couldn't find any that worked.

Link to comment
Share on other sites

Another attempt :-

 

		ldx #0		 ; 2
	lda byte	   ; 3
	cmp #200	   ; 2
	bcc below200   ; 2/3
	sbc #200	   ; 2
	ldx #20		; 2
	jmp below100   ; 3
below200
	cmp #100	   ; 2
	bcc below100   ; 2/3
	sbc #100	   ; 2
	ldx #10		; 2
below100
	cmp #50		; 2
	bcc below50	; 2/3
	sbc #50		; 2
	sta temp	   ; 3
	txa			; 2
	ldx temp	   ; 3
	adc #5-1	   ; 2 (Carry set)
	jmp above50	; 3
below50
	sta temp	   ; 3
	txa			; 2
	ldx temp	   ; 3
above50
	cpx #10		; 2
	bcc done	   ; 2/3
	adc #0		 ; 2
	cpx #20		; 2
	bcc done	   ; 2/3
	adc #0		 ; 2
	cpx #30		; 2
	bcc done	   ; 2/3
	adc #0		 ; 2
	cpx #40		; 2
	adc #0		 ; 2
done

 

Lots of branches and variable execution time. Tested against the Batari original.

Link to comment
Share on other sites

Hello, Batari and Thomas.

 

 

With Batari's findings of not needing CLC we are two cycles faster. I think I found a way to go another two cycles faster with an illegal opcode SBX.

 

 

;divide by 10
;54 cycles

ldx	#$80		; 2
lda	byte		; 3
lsr				; 2
adc	byte		; 3
sta	temp1	   ; 3
ror				; 2
sbx	#0		  ; 2   illegal opcode
stx	temp2	   ; 3
lsr				; 2
lsr				; 2
lsr				; 2
adc	temp1	   ; 3
ror				; 2
ora	temp2	   ; 2
lsr				; 2
lsr				; 2
lsr				; 2
adc	temp1	   ; 3
ror				; 2
ora	temp2	   ; 2
lsr				; 2
lsr				; 2
lsr				; 2

 

 

I have no idea if this illegal opcode is stable, but if it is then we are at 54 cycles and only 33 bytes. :) I hope this tests okay.

 

 

~Omega

Link to comment
Share on other sites

Here is a different approach yielding about the same efficiency.

; Divide unsigned value in A by 10.
; - How? Multiply by 26 (%11010), then divide by 256.
; Outputs :
; 	A = the answer.
;	X = 0 
;	Y = unchanged.
;	origvalue = the original value to be divided.
DivideBy10
LDX #0		  ;  2
STX temp		;  3
STA origvalue;  3

ASL			 ;  2
ROL temp		;  5

ADC	origvalue;  3
BCC .skip1	  ;  2/3
INC temp		;  5
.skip1	
ASL			 ;  2
ROL temp		;  5	 

ASL			 ;  2
ROL temp		;  5

ADC origvalue;  3
BCC .skip2		 ;  2/3
INC temp		;  5
.skip2	
ASL			 ;  2
LDA temp		;  3
ROL			 ;  2
				;-----
				; 56 cycles worst case
				; 48 cycles best case
				; 31 bytes

Edited by Robert M
Link to comment
Share on other sites

How about a hybrid table/code approach:

; Divide value in acc. by 10 and store the result in Y.  X trashed.
 tay
 lsr
 lsr
 lsr
 lsr
 tax
 tya
 sec; Could this be eliminated?
 sbc table1,y
 ldy table2,x
 cmp #10
 bcc nobump
 iny
 cmp #20
 bcc nobump
 iny
nobump:
...
table1:
 byte 0,1,3,4,6,8,9,11,12,14,16,17,19,20,22,24
table2:
 byte 0,10,30,40,60,80,90,110,120,140,160,170,190,200,220,240

Sixteen instructions totaling 36 cycles worst-case and 24+32=56 bytes

Link to comment
Share on other sites

I think I managed to save another 2 cycles, and 2 bytes by using the illegal opcode SAX.

 

 

;divide by 10
;52 cycles, 31 bytes

lda  byte	  ; 3
ldx  #$80	  ; 2
lsr			; 2
adc  byte	  ; 3
sta  temp1	 ; 3
ror			; 2
sax  temp2	 ; 3   illegal opcode
lsr			; 2
lsr			; 2
lsr			; 2
adc  temp1	 ; 3
ror			; 2
ora  temp2	 ; 2
lsr			; 2
lsr			; 2
lsr			; 2
adc  temp1	 ; 3
ror			; 2
ora  temp2	 ; 2
lsr			; 2
lsr			; 2
lsr			; 2

 

I can't believe how useful some of these illegal ops are in certain situations. Hopefully it is stable. ;)

Edited by Omegamatrix
Link to comment
Share on other sites

Robert M's made it to 68, then it failed. But this one is interesting - how does it work?

 

 

I am multiplying the value by the fraction 26/256 which is just slightly larger than 0.1.

Makes sense. It fails at 69 because 69*26/256=7.0078125

 

If you adjust the original value with value=value-value/64, it may fix it.

Link to comment
Share on other sites

Batari, can you please test this one if you have time? I was going through distella's debugger with it, and I think I adjusted it right, but I need confirmation.

 

 

Thanks if you can,

Omega

 

 

;divide by 10
;47 cycles, 28 bytes

tax				; 2
sta	temp		; 3
lsr				; 2
adc	temp		; 3
sta	temp		; 3
ror				; 2
lsr				; 2
lsr				; 2
lsr				; 2
adc	temp		; 3
cpx	#161		; 2
ror				; 2
lsr				; 2
lsr				; 2
lsr				; 2
adc	temp		; 3
cpx	#160		; 2
ror				; 2
lsr				; 2
lsr				; 2
lsr				; 2

Link to comment
Share on other sites

Not as efficient anymore, but I changed it to multiply by the fraction (25.5/256) which is closer to 0.1,

; Divide unsigned value in A by 10.  "floor(A/10)"
; - How?
; - Multiply by 25.5 (%11001.1), then divide by 256  
; Inputs :
;	A = 0-255 value to be divided by ten rounded down.
; Outputs :
;    A = the answer.
;    X = 0 
;    Y = unchanged.
;    origvalue = the original value in A divided by 2.
DivideBy10
    LDX #0          ;  2; This is a setup work that can
    STX temp        ;  3; be optimized out if the value
    STA origvalue    ;  3 ; to divide is already in origvalue
				; and there is a temp location 
				; already set to zero.

    ASL              ;  2; temp:A = origvalue * 2 
    ROL temp        ;  5

    ADC origvalue    ;  3; temp:A = origvalue * 3
    BCC .skip1      ;  2/3
    INC temp        ;  5
.skip1    

    ASL              ;  2; temp:A = origvalue * 6
    ROL temp        ;  5     
    
    ASL              ;  2; temp:A = origvalue * 12
    ROL temp        ;  5
    
ASL			;  2 ; temp:A = origvalue * 24
ROL temp	;  5

    ADC origvalue    ;  3; temp:A = origvalue * 25
    BCC .skip2       ;  2/3
    INC temp        ;  5
.skip2    

LSR origvalue  ;  5 ; temp:A = origvalue * 25.5	
ADC origvalue  ;  3		  
    LDA temp	;  3
ADC #0		 ;  2 ; A = (origvalue * 25.5) / 256
                    ;-----
                    ; 69 cycles worst case
                    ; 61 cycles best case
                    ; 38 bytes

 

I am curious now to try and make a general solution that can multiply any byte by a fraction.

Edited by Robert M
Link to comment
Share on other sites

Omegamatrix's works, but Robert M's fails at 10 (because 25.5*10/256=0.996.)

 

Robert M's original code will work if it's adjusted with origvalue=origvalue-origvalue/64.

 

It's doing (A-A/64)*26/256=A*25.59375/256, which is pretty close.

 

Unfortunately, I can't think of a better way to do it than this (add at the beginning of the code:

  sta origvalue
  rol
  rol
  rol
and #3
  sta temp
 lda origvalue
  sec
 sbc temp
 sta origvalue

EDIT: Corrected analysis (it's not exact, but works.)

Edited by batari
Link to comment
Share on other sites

  • 2 weeks later...

I took another stab at this today, and managed to save 11 cycles and 7 bytes over my last effort.

 

 

;divide by 10
;36 cycles, 21 bytes

tax				; 2
sta	temp		; 3
lsr				; 2
sec				; 2
adc	temp		; 3
sta	temp		; 3
ror				; 2
lsr				; 2
lsr				; 2
lsr				; 2
adc	temp		; 3
cpx	#160		; 2
ror				; 2
lsr				; 2
lsr				; 2
lsr				; 2

 

 

Edit: found a way to save another byte by setting the carry!

Edited by Omegamatrix
Link to comment
Share on other sites

Found a way to save another cycle, and preserve the X and Y registers at the same time. :) Same amount of bytes as before.

 

 

;divide by 10
;35 cycles, 21 bytes

sta	temp		; 3
ora	#$03		; 2
lsr				; 2
adc	temp		; 3
ror				; 2
sta	temp		; 3
lsr				; 2
lsr				; 2
lsr				; 2
adc	temp		; 3
adc	temp		; 3
ror				; 2
lsr				; 2
lsr				; 2
lsr				; 2

Link to comment
Share on other sites

Found a way to save another cycle, and preserve the X and Y registers at the same time. :) Same amount of bytes as before.

 

 

;divide by 10
;35 cycles, 21 bytes

sta	temp	; 3
ora	#$03	; 2
lsr			; 2
adc	temp	; 3
ror			; 2
sta	temp	; 3
lsr			; 2
lsr			; 2
lsr			; 2
adc	temp	; 3
adc	temp	; 3
ror			; 2
lsr			; 2
lsr			; 2
lsr			; 2

Now that's impressive :) Works, too.

Link to comment
Share on other sites

Found a way to save another cycle, and preserve the X and Y registers at the same time. :) Same amount of bytes as before.

 

 

;divide by 10
;35 cycles, 21 bytes

sta	temp; 3
ora	#$03; 2
lsr		; 2
adc	temp; 3
ror		; 2
sta	temp; 3
lsr		; 2
lsr		; 2
lsr		; 2
adc	temp; 3
adc	temp; 3
ror		; 2
lsr		; 2
lsr		; 2
lsr		; 2

Now that's impressive :) Works, too.

Wonder if you could impliment a similar process for other numbers too. (still preserving the x and y)
Link to comment
Share on other sites

There is also another process a friend of mine introduced me to once, but can't quite remember what it was called. It was either pure hex or true hex or something to that effect. It might be more adept at being used on a 6502 since it deals with hex values. But while I collect some more info on the subject, here is a part of the processes used in vedic for multiplication. It may not work well for a processor but it is still nice to know. It deals with the fact that the math is done with base 10 and uses it as so...

 

Suppose you need 8 x 7

 

8 is 2 below 10 and 7 is 3 below 10.

Think of it like this:

 

The answer is 56.

The diagram below shows how you get it.

 

You subtract crosswise 8-3 or 7 - 2 to get 5,

the first figure of the answer.

And you multiply vertically: 2 x 3 to get 6,

the last figure of the answer.

 

 

I'll get back after I see what I can find on the hex stuff.

I found what I was looking for. it is Prime HEX or cuban primes. here is a link for it and what it is used for. You can search for more info on the subject and may find some helpful things to use in math on the atari.

http://mathworld.wolfram.com/HexNumber.html

 

Here is another site with some visual arrays that look pretty damn cool.

http://www.thefoggiestnotion.com/prime_hex_array.htm

Edited by grafixbmp
Link to comment
Share on other sites

I thought about the topic that started it all, "Fast devide by 7" And decided to take an all encompasing aproach to an idea.

 

If a table array had the math layed out for every possible answer in decimal, bianary and hex, then a search for easy pattern recognition can be done to devise an efficient way to come to the right answer for groups of given senarios. That way you can have more than one way to get the answer depending on what the current value being worked with is. And given that the highest number you could even do with an 8 bit value is 255, then somewhere around up to say 4 diffrent ways could be constructed to obtain an answer that take extremely close to the same amount of cycles to accomplish the task since ROM space really isn't such a big issue. But then comes the decission making process of which method is used to get the answer based upon what the current value is that is being worked with. This part may be too much overhead. Sometimes visualizing the array of answers helps to see patterns where they could not be seen before. Basing this upon patterns in bianary or hex could be the key to optimizing the complete process. Sound good or is this pure smoke/speculation?

Edited by grafixbmp
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...