danwinslow Posted January 22, 2011 Share Posted January 22, 2011 Looking for asm examples of a fast general purpose memory copy. Thanks Quote Link to comment Share on other sites More sharing options...
Rybags Posted January 22, 2011 Share Posted January 22, 2011 Depends what you're doing. If it's a fixed move where the locations don't change then unrolled loops with absolute indexed will be fast. For a random move, a self-modified loop with absolute addressing can be fast. Other criteria include - is it always a multiple of 256 bytes being moved, or can it be any amount. Quote Link to comment Share on other sites More sharing options...
danwinslow Posted January 22, 2011 Author Share Posted January 22, 2011 General purpose, so no guaranteed boundaries, no set amount, and can't be unrolled. Not sure how self modify would be used, or why it would be faster than say just a zp pointer set. Could you show a small example? Quote Link to comment Share on other sites More sharing options...
Nukey Shay Posted January 22, 2011 Share Posted January 22, 2011 2 .word pointers...SOURCE and TARGET. X and Y hold the MSB and LSB of the # of bytes to move. If the routine is in RAM memory, you can just use absolute,Y indexing and adjust the source and target addresses when the loop rolls to zero (adjusting the MSB of both, bumping X, and reenter the loop if no rollover occurs). If the routine is in ROM, set up the pointers in zeropage space and use indirect(,Y) addressing. Keep in mind that since the routine is generic, you'd need 2 versions. Once that moves memory from the top down, and one that moves from the bottom up (to avoid any possible overlapping). A 2-byte comparison between SOURCE and TARGET at the beginning can be used to decide which one to branch to. Quote Link to comment Share on other sites More sharing options...
Rybags Posted January 22, 2011 Share Posted January 22, 2011 lda abs,x = 4 cycles or 5 if a page boundary crossed lda (zp),y = 5 cycles or 6 if page boundary crossed So there's savings to be had there. Generally you'd want your parameters to be SOURCE, DEST, and LENGTH. In this case, LENGTH is # of bytes -1. So, using indirect in a loop... ldy #0 ldx length+1 ; use X as page counter beq last ; if moving less than 256 bytes, skip main loop moveloop lda (source),y sta (dest),y ; hint - you can use multiple pairs of lda/sta here to save time, use one INY for each, number of lda/sta pairs must be a factor of 256 iny bne moveloop inc source inc dest dex bne moveloop last ; this bit moves the last part lda (source),y sta (dest),x cpy length beq finished iny bne last That should work - was off the top of my head. Obviously there can be some more optimisation without using too much more program space. The easiest speed boost is to just unroll that top loop some. 5 bytes of extra program reduces the number of branch occurrences, so a pair of LDA/STA would save about 256 cycles for every 256 bytes you're moving. Quote Link to comment Share on other sites More sharing options...
Creature XL Posted January 24, 2011 Share Posted January 24, 2011 lda abs,x = 4 cycles or 5 if a page boundary crossed lda (zp),y = 5 cycles or 6 if page boundary crossed So there's savings to be had there. Generally you'd want your parameters to be SOURCE, DEST, and LENGTH. In this case, LENGTH is # of bytes -1. So, using indirect in a loop... ldy #0 ldx length+1 ; use X as page counter beq last ; if moving less than 256 bytes, skip main loop moveloop lda (source),y sta (dest),y ; hint - you can use multiple pairs of lda/sta here to save time, use one INY for each, number of lda/sta pairs must be a factor of 256 iny bne moveloop inc source inc dest dex bne moveloop last ; this bit moves the last part lda (source),y sta (dest),x cpy length beq finished iny bne last That should work - was off the top of my head. Obviously there can be some more optimisation without using too much more program space. The easiest speed boost is to just unroll that top loop some. 5 bytes of extra program reduces the number of branch occurrences, so a pair of LDA/STA would save about 256 cycles for every 256 bytes you're moving. At least one typo: (),x I am not sure why the loop checking of the "last" loop is so "complex". Additionally I want to offer another version, which includes SMC. Might be that the OP can't use SMC but others might be reading/finding this thread so I think I offer it. It is the same as Rybags idea, so all the mentioned optimizations he stated apply. The difference is that I would pack this routine in the Zeropage and use SMC: "movei_w" is a macro with makes code more readable. It consist of 2 pairs of LDA/STA as you might have assumed. ; only owrks in Zeropage! fastmove: ldy #0 movei_w src,smcs+1 movei_w dst,smcd+1 ldx len+1 beq last loop: smcs: lda $f000,y ; gets overridden smcd: sta $f000,y ; ditto iny bne loop inc smcs+2 inc smcd+2 dex bne loop last: lda (smcs+1),y sta (smcd+1),y iny cpy #len bne last I hope this is error free. Its not tested! Quote Link to comment Share on other sites More sharing options...
Rybags Posted January 24, 2011 Share Posted January 24, 2011 I wouldn't bother running code in z-page unless every last cycle was critical. Really the only saving there is in the INC of the source/dest pointers and it occurs just once each per page. For the "last page" stuff, you can easily enough just copy the pointers from the top part of the code. Quote Link to comment Share on other sites More sharing options...
danwinslow Posted January 24, 2011 Author Share Posted January 24, 2011 Ok, thanks all. Yes, I wondering if I should mess around with ZP pointers...but no I don't need to squeeze every last cycle out, just wanted something that was more efficient than my lame 1st attempt. Quote Link to comment Share on other sites More sharing options...
Irgendwer Posted January 24, 2011 Share Posted January 24, 2011 A general hint: If you are looking for general purpose routines (mem-copy, parameter passing, dynamic memory allocation ), also the sources of cc65 are a good place. ftp://ftp.musoftware.de/pub/uz/cc65/snapshot/cc65-snapshot-sources-2.13.9.20110121.tar.bz2 Memcpy() f.e. is in "/cc65-snapshot-2.13.9.20110121/libsrc/common/memcpy.s" If you know how to improve the routines (size/speed balance in mind, no self-modifying code), feed-back is welcome. Quote Link to comment Share on other sites More sharing options...
Creature XL Posted January 25, 2011 Share Posted January 25, 2011 I wouldn't bother running code in z-page unless every last cycle was critical. Really the only saving there is in the INC of the source/dest pointers and it occurs just once each per page. For the "last page" stuff, you can easily enough just copy the pointers from the top part of the code. When I read the title of the thread I saw the "fast" and that means for me the fastest rout possible. But in no way I wanted to say my approach is the fastest or best, instead I want to learn from others with real crazy ideas. And I mentioned in my post that even if the original topic started doesn't needed the absolute-squeeze-every-cycle routine - someone else who actually uses the search function to find the fastest routine might find this thread as the topic stated "fast". In the end the 6502 is too "simple" to come up with more creative ideas I guess. Quote Link to comment Share on other sites More sharing options...
Irgendwer Posted January 26, 2011 Share Posted January 26, 2011 And I mentioned in my post that even if the original topic started doesn't needed the absolute-squeeze-every-cycle routine - someone else who actually uses the search function to find the fastest routine might find this thread as the topic stated "fast". In the end the 6502 is too "simple" to come up with more creative ideas I guess. It depends to much on the use case. (Your zero page code is much faster when unrolled in non zero page memory.) Should it be a mem-copy for the common use case, or are pages only sufficient? Do the memory regions overlap or are they independent? Are the memory regions fixed or variable? Should the code 'romable' or is self modifying code allowed? Do you have a size restriction for the code? The memcpy() from cc65 I mentioned is a quite fast if you take * the size restriction * coverage of the common use case * no use of SMC into account. There is even a memmove() which handles overlapping areas correct. But if you like to have an inspiration for another kind of enhancement, have a look into memset(), where the area/length is divided by two and two separate pointer are working on the memory section (works also for non overlapping mem-copy): 'Normal': Loop: lda (src),y (5+) sta (dest),y (6) iny (2) bne Loop (3) Gives ca. 16 cycles per byte, or 14 if you work with SMC (absolute indexed addressing). 'Double Pointer': Loop: lda (src1),y (5+) sta (dest1),y (6) lda (src2),y (5+) sta (dest2),y (6) iny (2) bne Loop (3) ca. 13.5 cycles per byte, or 11.5 cycles absolute indexed addressing unrolled: PageLoop: lda (src1),y (5+) sta (dest1),y (6) lda (src1),y (5+) sta (dest1),y (6) lda (src2),y (5+) sta (dest2),y (6) lda (src2),y (5+) sta (dest2),y (6) iny (2) bne PageLoop (3) ca 12.25 cycles per byte (absolute indexed addressing may not useful here due to long set up time for SMC) Of course you also have do take the ptr set up time into account, which may results in a longer processing time when working on small blocks. (In the memset() implementation, you can also identify how I handled misalignment for the two pointers.) Quote Link to comment Share on other sites More sharing options...
Rybags Posted January 26, 2011 Share Posted January 26, 2011 There's all kind of optimisations. That divide/double move case could also be improved by simply expanding the loop as well. Save 3 cycles of the branch for each doubling-up on the load/store operations. In your typical gaming situation, I don't think you'd have many cases of overlapped source/dest, other than if doing scrolling by memory moves. With PMGs it's usually best to just copy from source to the PMG area. I did one routine, not sure if I've still got it, that has an optimisation where it also performs the erase operation before or after the object as needed, based upon it's previous position. Quote Link to comment Share on other sites More sharing options...
Irgendwer Posted January 26, 2011 Share Posted January 26, 2011 (edited) My last block was obviously wrong. It should be: unrolled: PageLoop: lda (src1),y (5+) sta (dest1),y (6) lda (src2),y (5+) sta (dest2),y (6) iny (2) lda (src1),y (5+) sta (dest1),y (6) lda (src2),y (5+) sta (dest2),y (6) iny (2) bne PageLoop (3) ca 12.75 cycles per byte (absolute indexed addressing may not useful here due to long set up time for SMC) That divide/double move case could also be improved by simply expanding the loop as well. Save 3 cycles of the branch for each doubling-up on the load/store operations. Yes, this was what I like to illustrate with the last block. Of course you can also increase the number of 'pointers' to save even the offset modification cycles(2). Here a little example for a SMC page clearing routine with four 'pointers': lda #0 ldx #$3F loop: sta $zzC0, x sta $zz80, x sta $zz40, x sta $zz00, x dex bpl loop where the page number is written to the 'zz'-parts of the addresses. Edited January 26, 2011 by Irgendwer Quote Link to comment Share on other sites More sharing options...
Rybags Posted January 27, 2011 Share Posted January 27, 2011 Additionally, there would be some point in the process that rather than having a bunch of INY, it'd be quicker to just TYA/CLC/ADC #nn/TAY at the end of loop. That lot = 8 cycles, same as 4 * INY, so that's the break-even point. Quote Link to comment Share on other sites More sharing options...
Calphool9999 Posted October 28, 2012 Share Posted October 28, 2012 (edited) I know this thread is old, but since I just wasted several hours on this, I figured I'd give back to the group. This is a known working block memory copy (compiles with MADS). I don't know how optimized it is. It's based on an example I found from D.W. Howerton. opt h- org $600 source = $cb; (cb - cc) dest = $cd; (cd - ce) length = $d0; (d0) jmp memcpy ; jump entry in case we change something .proc memcpy ; x=usr($600, source, dest, length) cld ; make sure were not in BCD mode pla ; pop number of arguments from USR() into accumulator cmp #3 ; compare accumulator to 3 stop bne stop ; lock up if we dont have 3 parms (safety mechanism... warm boot tolerant) pla ; pop hi byte of source into A sta source+1 ; store hi byte of source into ($cc) pla ; pop lo byte of source into A sta source ; store hi byte of source into ($cb) pla ; pop hi byte of destination into A sta dest+1 ; store hi byte of destination into ($ce) pla ; pop lo byte of destination into A sta dest ; store lo byte of desination into ($cd) pla ; pop hi byte of length into A sta length ; store hi byte of length in ($d0) pla ; pop low byte of length into A tax ; move A (low byte of length) into X bne start ; jump to start if X > 0 dec length ; subtract 1 from length start ldy #0 ; set Y to 0 move lda (source),y ; set A to whatever (source) points to offset by Y sta (dest),y ; move A to location pointed to by (dest) offset by Y iny ; increment Y bne next ; if Y<>0 then (rolled over) then still moving bytes inc source+1 ; increment hi byte of source inc dest+1 ; increment hi byte of dest next dex ; decrement X (lo byte counter) bne move ; if X<>0 then move another byte dec length ; weve moved 255 bytes, dec length bpl move ; if length is still positive go back and move more rts ; done .endp When you're relearning 6502 as I am, it's nice to have known working code (without macros or other fancy stuff) to start from. Edited October 28, 2012 by Calphool9999 Quote Link to comment Share on other sites More sharing options...
Irgendwer Posted October 28, 2012 Share Posted October 28, 2012 This is a known working block memory copy (compiles with MADS). I don't know how optimized it is. It's a straight forward, basic approach and not optimized in respect of the discussion above. When you're relearning 6502 as I am, it's nice to have known working code (without macros or other fancy stuff) to start from. I can only suggest inspecting the cc65 library sources for good, working and even somewhat documented code. Keep on! Quote Link to comment Share on other sites More sharing options...
Shawn Jefferson Posted October 28, 2012 Share Posted October 28, 2012 Haha, as long as you don't stumble on the code I've contributed. Quote Link to comment Share on other sites More sharing options...
Franco Catrin Posted April 24, 2015 Share Posted April 24, 2015 (edited) Here is my take ; Generic memcpy optimized for speed. No overlapping ; fcatrin@gmail.com ; ; Copy operation is divided in two parts. N is the total length ; * memcpyLong for N / 256 blocks ; * memcpyShort for N % 256 remaining bytes memcpySrc = 214 memcpyDst = 216 memcpyLen = 218 memcpy ldy #0 ldx memcpyLen+1 beq memcpyShort ; We need only the short version for <1 pages memcpyLoopLong ; Copy X pages lda (memcpySrc),y ; Loop unrolling can be done with confidence here sta (memcpyDst),y ; any power of 2 will work iny bne memcpyLoopLong dex beq memcpyShort inc memcpySrc+1 ; Go to the next page inc memcpyDst+1 jmp memcpyLoopLong memcpyShort ; Copy remaining bytes ldx memcpyLen beq memcpyEnd memcpyLoopShort ; Copy X bytes lda (memcpySrc),y sta (memcpyDst),y iny dex bne memcpyLoopShort memcpyEnd rts Edited April 24, 2015 by Franco Catrin Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted April 25, 2015 Share Posted April 25, 2015 Memcopy using indirect addressing for speed? What about using self modified code instead? 2 Quote Link to comment Share on other sites More sharing options...
kenjennings Posted April 25, 2015 Share Posted April 25, 2015 Speed is relative. Compared to BASIC the crappiest assembler code that works is pure gold. 2 Quote Link to comment Share on other sites More sharing options...
emkay Posted April 25, 2015 Share Posted April 25, 2015 Speed is relative. Compared to BASIC the crappiest assembler code that works is pure gold. Ever tried the method of DIMming a String Variable with the Variable pointer set to graphics? The "self modifying code " is surely faster ... Quote Link to comment Share on other sites More sharing options...
flashjazzcat Posted April 25, 2015 Share Posted April 25, 2015 Ever tried the method of DIMming a String Variable with the Variable pointer set to graphics? The "self modifying code " is surely faster ... No-one disputes it, but the speed advantage of self-modifying code in assembly pales into insignificance compared to the speed advantage of assembler over BASIC. I think that's the point being made. Quote Link to comment Share on other sites More sharing options...
Franco Catrin Posted April 25, 2015 Share Posted April 25, 2015 Memcopy using indirect addressing for speed? What about using self modified code instead? It depends on how much setup code would you need considering the option to use unrolled loops. If you don't want unrolled loops, it's a good improvement on the code above 1 Quote Link to comment Share on other sites More sharing options...
emkay Posted April 25, 2015 Share Posted April 25, 2015 It depends on how much setup code would you need considering the option to use unrolled loops. If you don't want unrolled loops, it's a good improvement on the code above You have to keep in mind that CPU is limited. RAM is "more flexible" 1 Quote Link to comment Share on other sites More sharing options...
ricortes Posted April 25, 2015 Share Posted April 25, 2015 Pertinent to the object of the thread "fast general purpose memory copy" A few caveats of course. Much longer then the standard GP memory move. In all cases a hard coded memory move will be faster. Since it is longer then the the standard ~41 byte GP memory move, you would need to have at least two instances of calling the memory move before you reach break even on memory used compared to hard coded routines. REALLY TOO LONG AGO for me to remember much about the code. My nature is ~as soon as I am faster then everyone else, I stop optimizing. AFAIK I had a HD crash so the only thing I had left was the assembler output. In this case, since I was making an Action code block, the assembler output with hex listing was actually a good thing. If someone needs it as a BASIC or cc65 routine, let me know if you need help modifying it. 0001 0000 ; 0002 0000 ;Quick & dirty quick move left. Make sure the Dest and Source do 0003 0000 ;not overlap. It may still work depending on the number of bytes 0004 0000 ;moved or if the Source is higher then Dest, but don't take it 0005 0000 ;for granted. 0006 0000 ; 0007 0000 ;Written by Rick Cortese 6/18/96 using the Speech Technologies TASM 0008 0000 ;assembler for MS DOS. Basic standard core routine from 0009 0000 ;Assembly Language Subroutines by Leventhal, Levine. 0010 0000 ; 0011 0000 ;Calling convention from Action, Move(Source, Destination, Size) 0012 0000 ;This should place the Source pointer at $A0, Destination at $A2, & 0013 0000 ;Size at $A4. *NOTE* The Action manual is just plain wrong on this. 0014 0000 ;For passing more then 24 bits of parameters, the compiler calls a 0015 0000 ;routine to copy them to page zero. Near as I can tell, it uses 0016 0000 ;the X register as an index & *FAILS* to restore it! A & Y are intact, 0017 0000 ;& mirrored at $A0 & $A2, but you *MUST* LDX $A1 if you want it to 0018 0000 ;have the proper value as described in the manual. 0019 0000 ; 0020 0000 ;get some stuff defined 0021 0000 Source =$A0 0022 0000 Dest =$A2 0023 0000 Size =$A4 0024 0000 ; 0025 0000 ;It will be relocatable, but have to start it somewhere for the 0026 0000 ;assembler. $680 is good since people with MAC/65 can test it. 0027 0000 ; 0028 0680 *=$680 0029 0680 start 0030 0680 ; 0031 0680 ;If its less then 256 bytes, go straight to clean up routine 0032 0680 ; 0033 0680 A0 00 ldy #0 0034 0682 A6 A5 ldx Size+1 0035 0684 F0 13 beq part 0036 0686 ; 0037 0686 ;Something that should have been obvious to everyone looking for 0038 0686 ;speed, but I've never seen it used in any general purpose routine: 0039 0686 ;A PAGE IS 256 BYTES! Once you decide this, you could have 256 0040 0686 ; lda (Source),y 0041 0686 ; sta (Dest),y 0042 0686 ; iny 0043 0686 ;sequences with no branching! Since 256 is an even number & to keep the 0044 0686 ;code a bit smaller but not much slower, an even number of sequences with 0045 0686 ;a single branch at the end will invariably hit a zero at the last 0046 0686 ;iny in the sequence! 0047 0686 ; 0048 0686 page_move 0049 0686 B1 A0 lda (Source),y 0050 0688 91 A2 sta (Dest),y 0051 068A C8 iny 0052 068B B1 A0 lda (Source),y 0053 068D 91 A2 sta (Dest),y 0054 068F C8 iny 0055 0690 D0 F4 bne page_move 0056 0692 E6 A1 inc Source+1 0057 0694 E6 A3 inc Dest+1 0058 0696 CA dex 0059 0697 D0 ED bne page_move 0060 0699 part 0061 0699 A6 A4 ldx Size 0062 069B F0 14 beq exit 0063 069D ; 0064 069D ;ok, here's where another leap in logic takes place. You just found out 0065 069D ;the remainder is not equal to zero to get here. The only other values 0066 069D ;it could be are 1-255. Now how do you get a multiple sequence like 0067 069D ;above to work? Easy, just test bit zero! If bit zero =0 then you have 0068 069D ;at least 2 bytes to move so the mutiple sequence move above works. If 0069 069D ;bit zero does not equal 1, then you either have a single byte to move 0070 069D ;or an odd number of bytes. By jumping to the middle of the routine, 0071 069D ;you now have either moved the single byte or converted the odd number 0072 069D ;of bytes remaining to an even number suitable for the routine. 0073 069D ; 0074 069D A5 A4 lda Size 0075 069F 29 01 and #1 0076 06A1 D0 06 bne magic 0077 06A3 last 0078 06A3 B1 A0 lda (Source),y 0079 06A5 91 A2 sta (Dest),y 0080 06A7 C8 iny 0081 06A8 CA dex 0082 06A9 magic 0083 06A9 B1 A0 lda (Source),y 0084 06AB 91 A2 sta (Dest),y 0085 06AD C8 iny 0086 06AE CA dex 0087 06AF D0 F2 bne last 0088 06B1 exit 0089 06B1 ; 0090 06B1 ;Stick a: 0091 06B1 ; rts 0092 06B1 ;here if you want to call this as a subroutine. The RETURN in an 0093 06B1 ;Action program generates a rts & is needed after the code block 0094 06B1 ;below to keep the compiler from getting confused. 0095 06B1 ; 0096 06B1 .end 0097 06B1 ; 0098 06B1 ;In my tests, I found this to be about 10-12% faster then the Action 0099 06B1 ;built in routine. I've never seen this particular method done in 0100 06B1 ;a general purpose routine before, but then I admit I haven't 0101 06B1 ;seen everything ever written for the 6502. I did read every Antic, 0102 06B1 ;Compute, & about 1/2 the ANALOG magazines between 1984 & the time 0103 06B1 ;when they disappeared & never saw a routine like it. 0104 06B1 ;You could shave a few more tics off a move by using the 0105 06B1 ; lda address,x 0106 06B1 ; ... 0107 06B1 ;method, but then the move would have to be hard coded or self 0108 06B1 ;modifying. In the later case, the modification process may take 0109 06B1 ;up more time then is saved by the routine. 0110 06B1 ;It seems you can teach an old dog new tricks. 0111 06B1 ;The code block to include in your Action program is: 0112 06B1 ; 0113 06B1 ;PROC Move(CARD Source, Dest, Size) 0114 06B1 ; [$A0$00$A6$A5$F0$13$B1$A0$91$A2$C8$B1$A0$91$A2$C8 0115 06B1 ; $D0$F4$E6$A1$E6$A3$CA$D0$ED$A6$A4$F0$14$A5$A4$29 0116 06B1 ; $01$D0$06$B1$A0$91$A2$C8$CA$B1$A0$91$A2$C8$CA$D0 0117 06B1 ; $F2] RETURN 0118 06B1 ; tasm: Number of errors = 0 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.