fast memory copy

danwinslow · January 22, 2011

Looking for asm examples of a fast general purpose memory copy.

Thanks

Rybags · January 22, 2011

Depends what you're doing.

If it's a fixed move where the locations don't change then unrolled loops with absolute indexed will be fast.

For a random move, a self-modified loop with absolute addressing can be fast.

Other criteria include - is it always a multiple of 256 bytes being moved, or can it be any amount.

danwinslow · January 22, 2011

General purpose, so no guaranteed boundaries, no set amount, and can't be unrolled. Not sure how self modify would be used, or why it would be faster than say just a zp pointer set. Could you show a small example?

Nukey Shay · January 22, 2011

2 .word pointers...SOURCE and TARGET. X and Y hold the MSB and LSB of the # of bytes to move.

If the routine is in RAM memory, you can just use absolute,Y indexing and adjust the source and target addresses when the loop rolls to zero (adjusting the MSB of both, bumping X, and reenter the loop if no rollover occurs).

If the routine is in ROM, set up the pointers in zeropage space and use indirect(,Y) addressing.

Keep in mind that since the routine is generic, you'd need 2 versions. Once that moves memory from the top down, and one that moves from the bottom up (to avoid any possible overlapping). A 2-byte comparison between SOURCE and TARGET at the beginning can be used to decide which one to branch to.

Rybags · January 22, 2011

lda abs,x = 4 cycles or 5 if a page boundary crossed

lda (zp),y = 5 cycles or 6 if page boundary crossed

So there's savings to be had there.

Generally you'd want your parameters to be SOURCE, DEST, and LENGTH. In this case, LENGTH is # of bytes -1.

So, using indirect in a loop...

ldy #0
ldx length+1 ; use X as page counter
beq last ; if moving less than 256 bytes, skip main loop
moveloop
lda (source),y
sta (dest),y
; hint - you can use multiple pairs of lda/sta here to save time, use one INY for each, number of lda/sta pairs must be a factor of 256
iny
bne moveloop
inc source
inc dest
dex
bne moveloop
last
; this bit moves the last part
lda (source),y
sta (dest),x
cpy length
beq finished
iny
bne last

That should work - was off the top of my head. Obviously there can be some more optimisation without using too much more program space.

The easiest speed boost is to just unroll that top loop some. 5 bytes of extra program reduces the number of branch occurrences, so a pair of LDA/STA would save about 256 cycles for every 256 bytes you're moving.

Creature XL · January 24, 2011

lda abs,x = 4 cycles or 5 if a page boundary crossed

lda (zp),y = 5 cycles or 6 if page boundary crossed

So there's savings to be had there.

Generally you'd want your parameters to be SOURCE, DEST, and LENGTH. In this case, LENGTH is # of bytes -1.

So, using indirect in a loop...
ldy #0
ldx length+1 ; use X as page counter
beq last ; if moving less than 256 bytes, skip main loop
moveloop
lda (source),y
sta (dest),y
; hint - you can use multiple pairs of lda/sta here to save time, use one INY for each, number of lda/sta pairs must be a factor of 256
iny
bne moveloop
inc source
inc dest
dex
bne moveloop
last
; this bit moves the last part
lda (source),y
sta (dest),x
cpy length
beq finished
iny
bne last
That should work - was off the top of my head. Obviously there can be some more optimisation without using too much more program space.

The easiest speed boost is to just unroll that top loop some. 5 bytes of extra program reduces the number of branch occurrences, so a pair of LDA/STA would save about 256 cycles for every 256 bytes you're moving.

At least one typo: (),x

I am not sure why the loop checking of the "last" loop is so "complex".

Additionally I want to offer another version, which includes SMC. Might be that the OP can't use SMC but others might be reading/finding this thread so I think I offer it. It is the same as Rybags idea, so all the mentioned optimizations he stated apply. The difference is that I would pack this routine in the Zeropage and use SMC:

"movei_w" is a macro with makes code more readable. It consist of 2 pairs of LDA/STA as you might have assumed.

; only owrks in Zeropage!
fastmove:
	ldy	#0
	movei_w	src,smcs+1
	movei_w	dst,smcd+1
	ldx	len+1
	beq	last
loop:
smcs:		lda	$f000,y		; gets overridden
smcd:		sta	$f000,y		; ditto
	iny
	bne	loop
	inc	smcs+2
	inc	smcd+2
	dex
	bne	loop
last:
	lda	(smcs+1),y
	sta	(smcd+1),y
	iny
	cpy	#len
	bne	last

I hope this is error free. Its not tested!

Rybags · January 24, 2011

I wouldn't bother running code in z-page unless every last cycle was critical.

Really the only saving there is in the INC of the source/dest pointers and it occurs just once each per page.

For the "last page" stuff, you can easily enough just copy the pointers from the top part of the code.

danwinslow · January 24, 2011

Ok, thanks all. Yes, I wondering if I should mess around with ZP pointers...but no I don't need to squeeze every last cycle out, just wanted something that was more efficient than my lame 1st attempt.

Irgendwer · January 24, 2011

A general hint:

If you are looking for general purpose routines (mem-copy, parameter passing, dynamic memory allocation ), also the sources of cc65 are a good place.

ftp://ftp.musoftware.de/pub/uz/cc65/snapshot/cc65-snapshot-sources-2.13.9.20110121.tar.bz2

Memcpy() f.e. is in "/cc65-snapshot-2.13.9.20110121/libsrc/common/memcpy.s"

If you know how to improve the routines (size/speed balance in mind, no self-modifying code), feed-back is welcome.

Creature XL · January 25, 2011

I wouldn't bother running code in z-page unless every last cycle was critical.

Really the only saving there is in the INC of the source/dest pointers and it occurs just once each per page.

For the "last page" stuff, you can easily enough just copy the pointers from the top part of the code.

When I read the title of the thread I saw the "fast" and that means for me the fastest rout possible. But in no way I wanted to say my approach is the fastest or best, instead I want to learn from others with real crazy ideas.

And I mentioned in my post that even if the original topic started doesn't needed the absolute-squeeze-every-cycle routine - someone else who actually uses the search function to find the fastest routine might find this thread as the topic stated "fast". In the end the 6502 is too "simple" to come up with more creative ideas I guess.

Irgendwer · January 26, 2011

And I mentioned in my post that even if the original topic started doesn't needed the absolute-squeeze-every-cycle routine - someone else who actually uses the search function to find the fastest routine might find this thread as the topic stated "fast". In the end the 6502 is too "simple" to come up with more creative ideas I guess.

It depends to much on the use case. (Your zero page code is much faster when unrolled in non zero page memory.)

Should it be a mem-copy for the common use case, or are pages only sufficient?

Do the memory regions overlap or are they independent?

Are the memory regions fixed or variable?

Should the code 'romable' or is self modifying code allowed?

Do you have a size restriction for the code?

The memcpy() from cc65 I mentioned is a quite fast if you take

* the size restriction

* coverage of the common use case

* no use of SMC

into account.

There is even a memmove() which handles overlapping areas correct.

But if you like to have an inspiration for another kind of enhancement, have a look into memset(), where the area/length is divided by two and two separate pointer are working on the memory section (works also for non overlapping mem-copy):

'Normal':

Loop:

lda (src),y (5+)

sta (dest),y (6)

iny (2)

bne Loop (3)

Gives ca. 16 cycles per byte, or 14 if you work with SMC (absolute indexed addressing).

'Double Pointer':

Loop:

lda (src1),y (5+)

sta (dest1),y (6)

lda (src2),y (5+)

sta (dest2),y (6)

iny (2)

bne Loop (3)

ca. 13.5 cycles per byte, or 11.5 cycles absolute indexed addressing

unrolled:

PageLoop:

lda (src1),y (5+)

sta (dest1),y (6)

lda (src1),y (5+)

sta (dest1),y (6)

lda (src2),y (5+)

sta (dest2),y (6)

lda (src2),y (5+)

sta (dest2),y (6)

iny (2)

bne PageLoop (3)

ca 12.25 cycles per byte (absolute indexed addressing may not useful here due to long set up time for SMC)

Of course you also have do take the ptr set up time into account, which may results in a longer processing time when

working on small blocks. (In the memset() implementation, you can also identify how I handled misalignment for the two

pointers.)

Rybags · January 26, 2011

There's all kind of optimisations.

That divide/double move case could also be improved by simply expanding the loop as well. Save 3 cycles of the branch for each doubling-up on the load/store operations.

In your typical gaming situation, I don't think you'd have many cases of overlapped source/dest, other than if doing scrolling by memory moves.

With PMGs it's usually best to just copy from source to the PMG area. I did one routine, not sure if I've still got it, that has an optimisation where it also performs the erase operation before or after the object as needed, based upon it's previous position.

Irgendwer · January 26, 2011

My last block was obviously wrong. It should be:

unrolled:

PageLoop:

lda (src1),y (5+)

sta (dest1),y (6)

lda (src2),y (5+)

sta (dest2),y (6)

iny (2)

lda (src1),y (5+)

sta (dest1),y (6)

lda (src2),y (5+)

sta (dest2),y (6)

iny (2)

bne PageLoop (3)

ca 12.75 cycles per byte (absolute indexed addressing may not useful here due to long set up time for SMC)

That divide/double move case could also be improved by simply expanding the loop as well. Save 3 cycles of the branch for each doubling-up on the load/store operations.

Yes, this was what I like to illustrate with the last block. Of course you can also increase the number of 'pointers' to save even the offset modification cycles(2). Here a little example for a SMC page clearing routine with four 'pointers':

lda #0

ldx #$3F

loop:

sta $zzC0, x

sta $zz80, x

sta $zz40, x

sta $zz00, x

dex

bpl loop

where the page number is written to the 'zz'-parts of the addresses.

Edited January 26, 2011 by Irgendwer

Rybags · January 27, 2011

Additionally, there would be some point in the process that rather than having a bunch of INY, it'd be quicker to just TYA/CLC/ADC #nn/TAY at the end of loop.

That lot = 8 cycles, same as 4 * INY, so that's the break-even point.

Calphool9999 · October 28, 2012

I know this thread is old, but since I just wasted several hours on this, I figured I'd give back to the group. This is a known working block memory copy (compiles with MADS). I don't know how optimized it is. It's based on an example I found from D.W. Howerton.

opt h-
org $600

source = $cb; (cb - cc)
dest = $cd; (cd - ce)
length = $d0; (d0)

 jmp memcpy ; jump entry in case we change something

.proc memcpy ; x=usr($600, source, dest, length)
cld ; make sure were not in BCD mode
pla ; pop number of arguments from USR() into accumulator
cmp #3 ; compare accumulator to 3
stop bne stop ; lock up if we dont have 3 parms (safety mechanism... warm boot tolerant)
pla ; pop hi byte of source into A
sta source+1 ; store hi byte of source into ($cc)
pla ; pop lo byte of source into A
sta source ; store hi byte of source into ($cb)
pla ; pop hi byte of destination into A
sta dest+1 ; store hi byte of destination into ($ce)
pla ; pop lo byte of destination into A
sta dest ; store lo byte of desination into ($cd)
pla ; pop hi byte of length into A
sta length ; store hi byte of length in ($d0)
pla ; pop low byte of length into A
tax ; move A (low byte of length) into X
bne start ; jump to start if X > 0
dec length ; subtract 1 from length
start ldy #0 ; set Y to 0
move lda (source),y ; set A to whatever (source) points to offset by Y
sta (dest),y ; move A to location pointed to by (dest) offset by Y
iny ; increment Y
bne next ; if Y<>0 then (rolled over) then still moving bytes
inc source+1 ; increment hi byte of source
inc dest+1 ; increment hi byte of dest
next dex ; decrement X (lo byte counter)
bne move ; if X<>0 then move another byte
dec length ; weve moved 255 bytes, dec length
bpl move ; if length is still positive go back and move more
rts ; done
.endp

When you're relearning 6502 as I am, it's nice to have known working code (without macros or other fancy stuff) to start from.

Edited October 28, 2012 by Calphool9999

Irgendwer · October 28, 2012

This is a known working block memory copy (compiles with MADS). I don't know how optimized it is.

It's a straight forward, basic approach and not optimized in respect of the discussion above.

When you're relearning 6502 as I am, it's nice to have known working code (without macros or other fancy stuff) to start from.

I can only suggest inspecting the cc65 library sources for good, working and even somewhat documented code.

Keep on!

Shawn Jefferson · October 28, 2012

Haha, as long as you don't stumble on the code I've contributed.

Franco Catrin · April 24, 2015

Here is my take

; Generic memcpy optimized for speed. No overlapping

; fcatrin@gmail.com

;

; Copy operation is divided in two parts. N is the total length

; * memcpyLong for N / 256 blocks

; * memcpyShort for N % 256 remaining bytes

memcpySrc = 214

memcpyDst = 216

memcpyLen = 218

memcpy

ldy #0

ldx memcpyLen+1

beq memcpyShort ; We need only the short version for <1 pages

memcpyLoopLong ; Copy X pages

lda (memcpySrc),y ; Loop unrolling can be done with confidence here

sta (memcpyDst),y ; any power of 2 will work

iny

bne memcpyLoopLong

dex

beq memcpyShort

inc memcpySrc+1 ; Go to the next page

inc memcpyDst+1

jmp memcpyLoopLong

memcpyShort ; Copy remaining bytes

ldx memcpyLen

beq memcpyEnd

memcpyLoopShort ; Copy X bytes

lda (memcpySrc),y

sta (memcpyDst),y

iny

dex

bne memcpyLoopShort

memcpyEnd

rts

Edited April 24, 2015 by Franco Catrin

Heaven/TQA · April 25, 2015

Memcopy using indirect addressing for speed? What about using self modified code instead?

kenjennings · April 25, 2015

Speed is relative. Compared to BASIC the crappiest assembler code that works is pure gold.

emkay · April 25, 2015

Speed is relative. Compared to BASIC the crappiest assembler code that works is pure gold.

Ever tried the method of DIMming a String Variable with the Variable pointer set to graphics?

The "self modifying code " is surely faster ...

flashjazzcat · April 25, 2015

Ever tried the method of DIMming a String Variable with the Variable pointer set to graphics?

The "self modifying code " is surely faster ...

No-one disputes it, but the speed advantage of self-modifying code in assembly pales into insignificance compared to the speed advantage of assembler over BASIC. I think that's the point being made.

Franco Catrin · April 25, 2015

Memcopy using indirect addressing for speed? What about using self modified code instead?

It depends on how much setup code would you need considering the option to use unrolled loops. If you don't want unrolled loops, it's a good improvement on the code above

emkay · April 25, 2015

It depends on how much setup code would you need considering the option to use unrolled loops. If you don't want unrolled loops, it's a good improvement on the code above

You have to keep in mind that CPU is limited. RAM is "more flexible"

ricortes · April 25, 2015

Pertinent to the object of the thread "fast general purpose memory copy"

A few caveats of course. Much longer then the standard GP memory move. In all cases a hard coded memory move will be faster. Since it is longer then the the standard ~41 byte GP memory move, you would need to have at least two instances of calling the memory move before you reach break even on memory used compared to hard coded routines. REALLY TOO LONG AGO for me to remember much about the code. My nature is ~as soon as I am faster then everyone else, I stop optimizing. AFAIK I had a HD crash so the only thing I had left was the assembler output. In this case, since I was making an Action code block, the assembler output with hex listing was actually a good thing. If someone needs it as a BASIC or cc65 routine, let me know if you need help modifying it.

0001 0000 ;

0002 0000 ;Quick & dirty quick move left. Make sure the Dest and Source do

0003 0000 ;not overlap. It may still work depending on the number of bytes

0004 0000 ;moved or if the Source is higher then Dest, but don't take it

0005 0000 ;for granted.

0006 0000 ;

0007 0000 ;Written by Rick Cortese 6/18/96 using the Speech Technologies TASM

0008 0000 ;assembler for MS DOS. Basic standard core routine from

0009 0000 ;Assembly Language Subroutines by Leventhal, Levine.

0010 0000 ;

0011 0000 ;Calling convention from Action, Move(Source, Destination, Size)

0012 0000 ;This should place the Source pointer at $A0, Destination at $A2, &

0013 0000 ;Size at $A4. *NOTE* The Action manual is just plain wrong on this.

0014 0000 ;For passing more then 24 bits of parameters, the compiler calls a

0015 0000 ;routine to copy them to page zero. Near as I can tell, it uses

0016 0000 ;the X register as an index & *FAILS* to restore it! A & Y are intact,

0017 0000 ;& mirrored at $A0 & $A2, but you *MUST* LDX $A1 if you want it to

0018 0000 ;have the proper value as described in the manual.

0019 0000 ;

0020 0000 ;get some stuff defined

0021 0000 Source =$A0

0022 0000 Dest =$A2

0023 0000 Size =$A4

0024 0000 ;

0025 0000 ;It will be relocatable, but have to start it somewhere for the

0026 0000 ;assembler. $680 is good since people with MAC/65 can test it.

0027 0000 ;

0028 0680 *=$680

0029 0680 start

0030 0680 ;

0031 0680 ;If its less then 256 bytes, go straight to clean up routine

0032 0680 ;

0033 0680 A0 00 ldy #0

0034 0682 A6 A5 ldx Size+1

0035 0684 F0 13 beq part

0036 0686 ;

0037 0686 ;Something that should have been obvious to everyone looking for

0038 0686 ;speed, but I've never seen it used in any general purpose routine:

0039 0686 ;A PAGE IS 256 BYTES! Once you decide this, you could have 256

0040 0686 ; lda (Source),y

0041 0686 ; sta (Dest),y

0042 0686 ; iny

0043 0686 ;sequences with no branching! Since 256 is an even number & to keep the

0044 0686 ;code a bit smaller but not much slower, an even number of sequences with

0045 0686 ;a single branch at the end will invariably hit a zero at the last

0046 0686 ;iny in the sequence!

0047 0686 ;

0048 0686 page_move

0049 0686 B1 A0 lda (Source),y

0050 0688 91 A2 sta (Dest),y

0051 068A C8 iny

0052 068B B1 A0 lda (Source),y

0053 068D 91 A2 sta (Dest),y

0054 068F C8 iny

0055 0690 D0 F4 bne page_move

0056 0692 E6 A1 inc Source+1

0057 0694 E6 A3 inc Dest+1

0058 0696 CA dex

0059 0697 D0 ED bne page_move

0060 0699 part

0061 0699 A6 A4 ldx Size

0062 069B F0 14 beq exit

0063 069D ;

0064 069D ;ok, here's where another leap in logic takes place. You just found out

0065 069D ;the remainder is not equal to zero to get here. The only other values

0066 069D ;it could be are 1-255. Now how do you get a multiple sequence like

0067 069D ;above to work? Easy, just test bit zero! If bit zero =0 then you have

0068 069D ;at least 2 bytes to move so the mutiple sequence move above works. If

0069 069D ;bit zero does not equal 1, then you either have a single byte to move

0070 069D ;or an odd number of bytes. By jumping to the middle of the routine,

0071 069D ;you now have either moved the single byte or converted the odd number

0072 069D ;of bytes remaining to an even number suitable for the routine.

0073 069D ;

0074 069D A5 A4 lda Size

0075 069F 29 01 and #1

0076 06A1 D0 06 bne magic

0077 06A3 last

0078 06A3 B1 A0 lda (Source),y

0079 06A5 91 A2 sta (Dest),y

0080 06A7 C8 iny

0081 06A8 CA dex

0082 06A9 magic

0083 06A9 B1 A0 lda (Source),y

0084 06AB 91 A2 sta (Dest),y

0085 06AD C8 iny

0086 06AE CA dex

0087 06AF D0 F2 bne last

0088 06B1 exit

0089 06B1 ;

0090 06B1 ;Stick a:

0091 06B1 ; rts

0092 06B1 ;here if you want to call this as a subroutine. The RETURN in an

0093 06B1 ;Action program generates a rts & is needed after the code block

0094 06B1 ;below to keep the compiler from getting confused.

0095 06B1 ;

0096 06B1 .end

0097 06B1 ;

0098 06B1 ;In my tests, I found this to be about 10-12% faster then the Action

0099 06B1 ;built in routine. I've never seen this particular method done in

0100 06B1 ;a general purpose routine before, but then I admit I haven't

0101 06B1 ;seen everything ever written for the 6502. I did read every Antic,

0102 06B1 ;Compute, & about 1/2 the ANALOG magazines between 1984 & the time

0103 06B1 ;when they disappeared & never saw a routine like it.

0104 06B1 ;You could shave a few more tics off a move by using the

0105 06B1 ; lda address,x

0106 06B1 ; ...

0107 06B1 ;method, but then the move would have to be hard coded or self

0108 06B1 ;modifying. In the later case, the modification process may take

0109 06B1 ;up more time then is saved by the routine.

0110 06B1 ;It seems you can teach an old dog new tricks.

0111 06B1 ;The code block to include in your Action program is:

0112 06B1 ;

0113 06B1 ;PROC Move(CARD Source, Dest, Size)

0114 06B1 ; [$A0$00$A6$A5$F0$13$B1$A0$91$A2$C8$B1$A0$91$A2$C8

0115 06B1 ; $D0$F4$E6$A1$E6$A3$CA$D0$ED$A6$A4$F0$14$A5$A4$29

0116 06B1 ; $01$D0$06$B1$A0$91$A2$C8$CA$B1$A0$91$A2$C8$CA$D0

0117 06B1 ; $F2] RETURN

0118 06B1 ;

tasm: Number of errors = 0

fast memory copy

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members