+batari Posted January 21, 2009 Share Posted January 21, 2009 Hello Batari, I tried to improve the code you posted, and I think I've saved 5 cycles, and 1 register (comparing the two used to hold the processer status, vs a second temp that could be in the same location) at a cost of 2 bytes. Though I haven't fully tested the code (and everytime I do that it bites me in the ass!), but I've been trying this for hours now and I'm tired. I think it works. It works. I wrote a quick test program before I posted mine above, and yours passed I thought there was a way to do it without the php/plp instructions. Quote Link to comment Share on other sites More sharing options...
Omegamatrix Posted January 22, 2009 Share Posted January 22, 2009 (edited) Hello again, Batari. I made a second attempt today. Can you please test it with your program also? This one uses no branching and it takes 58 cycles all the time, which is 2-4 cycles faster then my last code. It also saves one byte over my last attempt. Its disadvantages are that it uses an index register. If the user doesn't have a free index register, they might be able to substitute PHA / PLA for TAY / TYA at a cost of adding 3 cycles, depending where the stack pointer is. ;divide by 10 ;takes 58 cycles lda byte ; 3 lsr ; 2 clc ; 2 adc byte ; 3 sta temp1 ; 3 ror ; 2 tay ; 2 and #$80 ; 2 sta temp2 ; 3 tya ; 2 lsr ; 2 lsr ; 2 lsr ; 2 adc temp1 ; 3 ror ; 2 ora temp2 ; 3 lsr ; 2 lsr ; 2 lsr ; 2 adc temp1 ; 3 ror ; 2 ora temp2 ; 3 lsr ; 2 lsr ; 2 lsr ; 2 Edited January 22, 2009 by Omegamatrix Quote Link to comment Share on other sites More sharing options...
+batari Posted January 22, 2009 Share Posted January 22, 2009 Hello again, Batari. I made a second attempt today. Can you please test it with your program also? This one uses no branching and it takes 58 cycles all the time, which is 2-4 cycles faster then my last code. It also saves one byte over my last attempt. Its disadvantages are that it uses an index register. If the user doesn't have a free index register, they might be able to substitute PHA / PLA for TAY / TYA at a cost of adding 3 cycles, depending where the stack pointer is. It works. I found that the clc isn't needed, so there's 2 more cycles shaved off. I tried to use some illegal opcodes but couldn't find any that worked. Quote Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted January 22, 2009 Share Posted January 22, 2009 What's the task? Dividing an 8 bit number by 10? Quote Link to comment Share on other sites More sharing options...
+batari Posted January 22, 2009 Share Posted January 22, 2009 What's the task? Dividing an 8 bit number by 10? Yep. Quote Link to comment Share on other sites More sharing options...
GroovyBee Posted January 23, 2009 Share Posted January 23, 2009 Another attempt :- ldx #0 ; 2 lda byte ; 3 cmp #200 ; 2 bcc below200 ; 2/3 sbc #200 ; 2 ldx #20 ; 2 jmp below100 ; 3 below200 cmp #100 ; 2 bcc below100 ; 2/3 sbc #100 ; 2 ldx #10 ; 2 below100 cmp #50 ; 2 bcc below50 ; 2/3 sbc #50 ; 2 sta temp ; 3 txa ; 2 ldx temp ; 3 adc #5-1 ; 2 (Carry set) jmp above50 ; 3 below50 sta temp ; 3 txa ; 2 ldx temp ; 3 above50 cpx #10 ; 2 bcc done ; 2/3 adc #0 ; 2 cpx #20 ; 2 bcc done ; 2/3 adc #0 ; 2 cpx #30 ; 2 bcc done ; 2/3 adc #0 ; 2 cpx #40 ; 2 adc #0 ; 2 done Lots of branches and variable execution time. Tested against the Batari original. Quote Link to comment Share on other sites More sharing options...
Omegamatrix Posted January 23, 2009 Share Posted January 23, 2009 Hello, Batari and Thomas. With Batari's findings of not needing CLC we are two cycles faster. I think I found a way to go another two cycles faster with an illegal opcode SBX. ;divide by 10 ;54 cycles ldx #$80 ; 2 lda byte ; 3 lsr ; 2 adc byte ; 3 sta temp1 ; 3 ror ; 2 sbx #0 ; 2 illegal opcode stx temp2 ; 3 lsr ; 2 lsr ; 2 lsr ; 2 adc temp1 ; 3 ror ; 2 ora temp2 ; 2 lsr ; 2 lsr ; 2 lsr ; 2 adc temp1 ; 3 ror ; 2 ora temp2 ; 2 lsr ; 2 lsr ; 2 lsr ; 2 I have no idea if this illegal opcode is stable, but if it is then we are at 54 cycles and only 33 bytes. I hope this tests okay. ~Omega Quote Link to comment Share on other sites More sharing options...
Robert M Posted January 23, 2009 Share Posted January 23, 2009 (edited) Here is a different approach yielding about the same efficiency. ; Divide unsigned value in A by 10. ; - How? Multiply by 26 (%11010), then divide by 256. ; Outputs : ; A = the answer. ; X = 0 ; Y = unchanged. ; origvalue = the original value to be divided. DivideBy10 LDX #0 ; 2 STX temp ; 3 STA origvalue; 3 ASL ; 2 ROL temp ; 5 ADC origvalue; 3 BCC .skip1 ; 2/3 INC temp ; 5 .skip1 ASL ; 2 ROL temp ; 5 ASL ; 2 ROL temp ; 5 ADC origvalue; 3 BCC .skip2 ; 2/3 INC temp ; 5 .skip2 ASL ; 2 LDA temp ; 3 ROL ; 2 ;----- ; 56 cycles worst case ; 48 cycles best case ; 31 bytes Edited January 23, 2009 by Robert M Quote Link to comment Share on other sites More sharing options...
+batari Posted January 23, 2009 Share Posted January 23, 2009 The latest by Omegamatrix works. Robert M's made it to 68, then it failed. But this one is interesting - how does it work? Quote Link to comment Share on other sites More sharing options...
supercat Posted January 23, 2009 Share Posted January 23, 2009 How about a hybrid table/code approach: ; Divide value in acc. by 10 and store the result in Y. X trashed. tay lsr lsr lsr lsr tax tya sec; Could this be eliminated? sbc table1,y ldy table2,x cmp #10 bcc nobump iny cmp #20 bcc nobump iny nobump: ... table1: byte 0,1,3,4,6,8,9,11,12,14,16,17,19,20,22,24 table2: byte 0,10,30,40,60,80,90,110,120,140,160,170,190,200,220,240 Sixteen instructions totaling 36 cycles worst-case and 24+32=56 bytes Quote Link to comment Share on other sites More sharing options...
+batari Posted January 23, 2009 Share Posted January 23, 2009 (edited) Supercat's works, with the following corrections: sbc table2,x ldy table1,x The sec is required. Edited January 23, 2009 by batari Quote Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted January 23, 2009 Share Posted January 23, 2009 I wonder if the BCD mode could be used for this. Quote Link to comment Share on other sites More sharing options...
Omegamatrix Posted January 24, 2009 Share Posted January 24, 2009 (edited) I think I managed to save another 2 cycles, and 2 bytes by using the illegal opcode SAX. ;divide by 10 ;52 cycles, 31 bytes lda byte ; 3 ldx #$80 ; 2 lsr ; 2 adc byte ; 3 sta temp1 ; 3 ror ; 2 sax temp2 ; 3 illegal opcode lsr ; 2 lsr ; 2 lsr ; 2 adc temp1 ; 3 ror ; 2 ora temp2 ; 2 lsr ; 2 lsr ; 2 lsr ; 2 adc temp1 ; 3 ror ; 2 ora temp2 ; 2 lsr ; 2 lsr ; 2 lsr ; 2 I can't believe how useful some of these illegal ops are in certain situations. Hopefully it is stable. Edited January 24, 2009 by Omegamatrix Quote Link to comment Share on other sites More sharing options...
Robert M Posted January 24, 2009 Share Posted January 24, 2009 Robert M's made it to 68, then it failed. But this one is interesting - how does it work? I am multiplying the value by the fraction 26/256 which is just slightly larger than 0.1. Quote Link to comment Share on other sites More sharing options...
+batari Posted January 24, 2009 Share Posted January 24, 2009 Robert M's made it to 68, then it failed. But this one is interesting - how does it work? I am multiplying the value by the fraction 26/256 which is just slightly larger than 0.1. Makes sense. It fails at 69 because 69*26/256=7.0078125 If you adjust the original value with value=value-value/64, it may fix it. Quote Link to comment Share on other sites More sharing options...
Omegamatrix Posted January 24, 2009 Share Posted January 24, 2009 Batari, can you please test this one if you have time? I was going through distella's debugger with it, and I think I adjusted it right, but I need confirmation. Thanks if you can, Omega ;divide by 10 ;47 cycles, 28 bytes tax ; 2 sta temp ; 3 lsr ; 2 adc temp ; 3 sta temp ; 3 ror ; 2 lsr ; 2 lsr ; 2 lsr ; 2 adc temp ; 3 cpx #161 ; 2 ror ; 2 lsr ; 2 lsr ; 2 lsr ; 2 adc temp ; 3 cpx #160 ; 2 ror ; 2 lsr ; 2 lsr ; 2 lsr ; 2 Quote Link to comment Share on other sites More sharing options...
Robert M Posted January 24, 2009 Share Posted January 24, 2009 (edited) Not as efficient anymore, but I changed it to multiply by the fraction (25.5/256) which is closer to 0.1, ; Divide unsigned value in A by 10. "floor(A/10)" ; - How? ; - Multiply by 25.5 (%11001.1), then divide by 256 ; Inputs : ; A = 0-255 value to be divided by ten rounded down. ; Outputs : ; A = the answer. ; X = 0 ; Y = unchanged. ; origvalue = the original value in A divided by 2. DivideBy10 LDX #0 ; 2; This is a setup work that can STX temp ; 3; be optimized out if the value STA origvalue ; 3 ; to divide is already in origvalue ; and there is a temp location ; already set to zero. ASL ; 2; temp:A = origvalue * 2 ROL temp ; 5 ADC origvalue ; 3; temp:A = origvalue * 3 BCC .skip1 ; 2/3 INC temp ; 5 .skip1 ASL ; 2; temp:A = origvalue * 6 ROL temp ; 5 ASL ; 2; temp:A = origvalue * 12 ROL temp ; 5 ASL ; 2 ; temp:A = origvalue * 24 ROL temp ; 5 ADC origvalue ; 3; temp:A = origvalue * 25 BCC .skip2 ; 2/3 INC temp ; 5 .skip2 LSR origvalue ; 5 ; temp:A = origvalue * 25.5 ADC origvalue ; 3 LDA temp ; 3 ADC #0 ; 2 ; A = (origvalue * 25.5) / 256 ;----- ; 69 cycles worst case ; 61 cycles best case ; 38 bytes I am curious now to try and make a general solution that can multiply any byte by a fraction. Edited January 24, 2009 by Robert M Quote Link to comment Share on other sites More sharing options...
+batari Posted January 24, 2009 Share Posted January 24, 2009 (edited) Omegamatrix's works, but Robert M's fails at 10 (because 25.5*10/256=0.996.) Robert M's original code will work if it's adjusted with origvalue=origvalue-origvalue/64. It's doing (A-A/64)*26/256=A*25.59375/256, which is pretty close. Unfortunately, I can't think of a better way to do it than this (add at the beginning of the code: sta origvalue rol rol rol and #3 sta temp lda origvalue sec sbc temp sta origvalue EDIT: Corrected analysis (it's not exact, but works.) Edited January 24, 2009 by batari Quote Link to comment Share on other sites More sharing options...
Omegamatrix Posted February 7, 2009 Share Posted February 7, 2009 (edited) I took another stab at this today, and managed to save 11 cycles and 7 bytes over my last effort. ;divide by 10 ;36 cycles, 21 bytes tax ; 2 sta temp ; 3 lsr ; 2 sec ; 2 adc temp ; 3 sta temp ; 3 ror ; 2 lsr ; 2 lsr ; 2 lsr ; 2 adc temp ; 3 cpx #160 ; 2 ror ; 2 lsr ; 2 lsr ; 2 lsr ; 2 Edit: found a way to save another byte by setting the carry! Edited February 7, 2009 by Omegamatrix Quote Link to comment Share on other sites More sharing options...
Omegamatrix Posted February 8, 2009 Share Posted February 8, 2009 Found a way to save another cycle, and preserve the X and Y registers at the same time. Same amount of bytes as before. ;divide by 10 ;35 cycles, 21 bytes sta temp ; 3 ora #$03 ; 2 lsr ; 2 adc temp ; 3 ror ; 2 sta temp ; 3 lsr ; 2 lsr ; 2 lsr ; 2 adc temp ; 3 adc temp ; 3 ror ; 2 lsr ; 2 lsr ; 2 lsr ; 2 Quote Link to comment Share on other sites More sharing options...
+batari Posted February 8, 2009 Share Posted February 8, 2009 Found a way to save another cycle, and preserve the X and Y registers at the same time. Same amount of bytes as before. ;divide by 10 ;35 cycles, 21 bytes sta temp ; 3 ora #$03 ; 2 lsr ; 2 adc temp ; 3 ror ; 2 sta temp ; 3 lsr ; 2 lsr ; 2 lsr ; 2 adc temp ; 3 adc temp ; 3 ror ; 2 lsr ; 2 lsr ; 2 lsr ; 2 Now that's impressive Works, too. Quote Link to comment Share on other sites More sharing options...
+grafixbmp Posted February 8, 2009 Share Posted February 8, 2009 Found a way to save another cycle, and preserve the X and Y registers at the same time. Same amount of bytes as before. ;divide by 10 ;35 cycles, 21 bytes sta temp; 3 ora #$03; 2 lsr ; 2 adc temp; 3 ror ; 2 sta temp; 3 lsr ; 2 lsr ; 2 lsr ; 2 adc temp; 3 adc temp; 3 ror ; 2 lsr ; 2 lsr ; 2 lsr ; 2 Now that's impressive Works, too. Wonder if you could impliment a similar process for other numbers too. (still preserving the x and y) Quote Link to comment Share on other sites More sharing options...
Omegamatrix Posted February 10, 2009 Share Posted February 10, 2009 Managed to shave another byte off. Same cycles as before. ;divide by 10 ;35 cycles, 20 bytes lsr ; 2 sta temp ; 3 lsr ; 2 sec ; 2 adc temp ; 3 sta temp ; 3 lsr ; 2 lsr ; 2 lsr ; 2 adc temp ; 3 adc temp ; 3 ror ; 2 lsr ; 2 lsr ; 2 lsr ; 2 Quote Link to comment Share on other sites More sharing options...
+grafixbmp Posted February 11, 2009 Share Posted February 11, 2009 (edited) There is also another process a friend of mine introduced me to once, but can't quite remember what it was called. It was either pure hex or true hex or something to that effect. It might be more adept at being used on a 6502 since it deals with hex values. But while I collect some more info on the subject, here is a part of the processes used in vedic for multiplication. It may not work well for a processor but it is still nice to know. It deals with the fact that the math is done with base 10 and uses it as so... Suppose you need 8 x 7 8 is 2 below 10 and 7 is 3 below 10. Think of it like this: The answer is 56. The diagram below shows how you get it. You subtract crosswise 8-3 or 7 - 2 to get 5, the first figure of the answer. And you multiply vertically: 2 x 3 to get 6, the last figure of the answer. I'll get back after I see what I can find on the hex stuff. I found what I was looking for. it is Prime HEX or cuban primes. here is a link for it and what it is used for. You can search for more info on the subject and may find some helpful things to use in math on the atari. http://mathworld.wolfram.com/HexNumber.html Here is another site with some visual arrays that look pretty damn cool. http://www.thefoggiestnotion.com/prime_hex_array.htm Edited February 11, 2009 by grafixbmp Quote Link to comment Share on other sites More sharing options...
+grafixbmp Posted February 11, 2009 Share Posted February 11, 2009 (edited) I thought about the topic that started it all, "Fast devide by 7" And decided to take an all encompasing aproach to an idea. If a table array had the math layed out for every possible answer in decimal, bianary and hex, then a search for easy pattern recognition can be done to devise an efficient way to come to the right answer for groups of given senarios. That way you can have more than one way to get the answer depending on what the current value being worked with is. And given that the highest number you could even do with an 8 bit value is 255, then somewhere around up to say 4 diffrent ways could be constructed to obtain an answer that take extremely close to the same amount of cycles to accomplish the task since ROM space really isn't such a big issue. But then comes the decission making process of which method is used to get the answer based upon what the current value is that is being worked with. This part may be too much overhead. Sometimes visualizing the array of answers helps to see patterns where they could not be seen before. Basing this upon patterns in bianary or hex could be the key to optimizing the complete process. Sound good or is this pure smoke/speculation? Edited February 11, 2009 by grafixbmp Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.