IGNORED

# Fastest Possible bulk copy?

## Recommended Posts

The context I'm thinking of is a bulk copy - 4K or more, e.g. a whole screen being paged in during flyback

that sort of thing.

So - simplifying assumptions; source and dest do NOT overlap, source and dest are page aligned, size of copy is a multiple of whole pages.

If we use the "natural" coding (two zero page pointers, indexed by Y, and unroll the loop enough, we can get asymptotically close to

12 cycles per byte;

lda (SRC),y       ; 5
sta (DST),y       ; 5
iny               ; 2

This is simple to understand, and (in fact) doesn't really need all my simplifying assumptions.

To go faster, I think we need self-modifying code, using abs,x addressing (4 cycles).

Consider the following fragment:

src1:  lda \$aa00,x   ; 4
dst1:  sta \$bb00,x   ; 4
src2:  lda \$aa01,x   ; 4
dst2:  sta \$bb01,x   ; 4
src3:  lda \$aa02,x   ; 4
dst3:  sta \$bb02,x   ; 4
.
.

By using incrementing low bytes on the absolute addresses, we avoid the need to increment x. And we've saved a cycle on the load and store. This is 8 cycles per byte.

So, we unroll the loop, and add "some number" to x each time round the loop. If we make the unrolling a power of 2, this will neatly come out to a page. However, we also need to perform the self modifying for each page, setting the (incrementing) page location of SRC and DST into the high byte of each absolute address; we need to set SRC at src1+2, src2+2, etc and DST at dst1+2, dst2+2, etc. Each of these STA \$xxxx takes 4 cycles.

If we unroll by N, where N is one of 4,8,16 etc, the copying loop will take N*8 cycles (plus a handful more for the loop back, 8 or 10).
For the self-mod, we will also have N*8, although the self mod code is run once per page; the loop is run (256/N) per page.

So the total cycles per page, in an unroll of N is around:

N*8 + (256/N)*(N*8* + 10)

So if we make N too big, we'll do too much self-mod, and if we make N too small, the copy loop won't be fast. What's best?

Setting up this equation in a dirty fragment of perl, I get:

4:2720
rate 10.625000
8:2432
rate 9.500000
16:2336
rate 9.125000
32:2384
rate 9.312500
64:2600
rate 10.156250
128:3092
rate 12.078125
(rate is cycles per byte).

So 16 appears best.

The result (at the hand-waving level) is a block copy running at 9.1 cycles per byte. I've probably left out some overhead in this, but I think the overall analysis is OK.

BugBear

Edited by bugbear
##### Share on other sites

ldx #0          ; 2

lda \$2000,x     ; 4
sta \$3000,x     ; 4
lda \$2100,x
sta \$3100,x
lda \$2200,x
sta \$3200,x
lda \$2300,x
sta \$3300,x
lda \$2400,x
sta \$3400,x
lda \$2500,x
sta \$3500,x
lda \$2600,x
sta \$3600,x
lda \$2700,x
sta \$3700,x
lda \$2800,x
sta \$3800,x
lda \$2900,x
sta \$3900,x
lda \$2a00,x
sta \$3a00,x
lda \$2b00,x
sta \$3b00,x
lda \$2c00,x
sta \$3c00,x
lda \$2d00,x
sta \$3d00,x
lda \$2e00,x
sta \$3e00,x
lda \$2f00,x
sta \$3f00,x

inx             ; 2
bne copy        ; 3

; (2 + (16*8 + 2 + 3) * 256)/4096 = 8.312988 cycles per byte

This is how I would bulk copy 4kB (16 pages).

Edited by ivop
##### Share on other sites

Absolute fastest is just

LDA SRC

STA DEST

LDA SRC + 1

STA DEST + 1

8 cycles per byte but an obscene memory requirement of <screen size> * 6 so about 48K for a hires or 24K for a Graphics 7 bitmap.

But not so bad for a text screen, under 6K.

A less costly method in terms of memory and not too much worse for cycles is using indexed addressing for every page of screen RAM.

To ensure maximum efficiency, set your screen base to minimize crossing page boundaries in the move.

LDX #0

COPY LDA SRC,X

STA DEST,X

LDA SRC+\$100,X

STA DEST+\$100,X

etc. until

LDA SRC+\$1E00,X

STA DEST+\$1E00,X

INX

BNE COPY

Alternative to this you could use self-modifying code to make the internals of the loop smaller and just modify the operands and run the loop again.

Or better alternative - given the flexibility of the Atari screen architecture you can use DLists with LMS and fine scrolling to mimimize or eliminate having to move memory around.

Edited by Rybags
##### Share on other sites

sta \$3000,x     ; 4

Unfortunately not (5).

##### Share on other sites

if I have the mem... I would do brute force

.rept 4096

LDA src+#

STA dst+#

.endr

##### Share on other sites

>> This is how I would bulk copy 4kB (16 pages).

Unrolling the outer loop instead of the inner - I love it!

BugBear

##### Share on other sites

In the modern day with large flashcarts and Ram sizes, entirely feasible to do enormous brute-force lumps of code rather than time consuming loops.

Also worth mentioning the rendering technique exposed by Heaven in the Activision Zone Ranger game. Has a thread from several years ago somewhere around here.

##### Share on other sites

Unfortunately not (5).

Aw, yes. Good catch.

(2 + (16*9 + 2 + 3) * 256)/4096 = 9.312988 cycles per byte

I just copied 4 and 4 from bugbear's post, but a abs,x store is indeed one cycle more.

##### Share on other sites

In the modern day with large flashcarts and Ram sizes, entirely feasible to do enormous brute-force lumps of code rather than time consuming loops.

In the modern age, a GPU can shift giga-bytes per second. But (for me) being full-on-retro is more fun.

BugBear

## Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

×   Pasted as rich text.   Paste as plain text instead

Only 75 emoji are allowed.