bugbear Posted May 8, 2018 Share Posted May 8, 2018 (edited) The context I'm thinking of is a bulk copy - 4K or more, e.g. a whole screen being paged in during flyback that sort of thing. So - simplifying assumptions; source and dest do NOT overlap, source and dest are page aligned, size of copy is a multiple of whole pages. If we use the "natural" coding (two zero page pointers, indexed by Y, and unroll the loop enough, we can get asymptotically close to 12 cycles per byte; lda (SRC),y ; 5 sta (DST),y ; 5 iny ; 2 This is simple to understand, and (in fact) doesn't really need all my simplifying assumptions.To go faster, I think we need self-modifying code, using abs,x addressing (4 cycles).Consider the following fragment: src1: lda $aa00,x ; 4 dst1: sta $bb00,x ; 4 src2: lda $aa01,x ; 4 dst2: sta $bb01,x ; 4 src3: lda $aa02,x ; 4 dst3: sta $bb02,x ; 4 . . By using incrementing low bytes on the absolute addresses, we avoid the need to increment x. And we've saved a cycle on the load and store. This is 8 cycles per byte. So, we unroll the loop, and add "some number" to x each time round the loop. If we make the unrolling a power of 2, this will neatly come out to a page. However, we also need to perform the self modifying for each page, setting the (incrementing) page location of SRC and DST into the high byte of each absolute address; we need to set SRC at src1+2, src2+2, etc and DST at dst1+2, dst2+2, etc. Each of these STA $xxxx takes 4 cycles. If we unroll by N, where N is one of 4,8,16 etc, the copying loop will take N*8 cycles (plus a handful more for the loop back, 8 or 10).For the self-mod, we will also have N*8, although the self mod code is run once per page; the loop is run (256/N) per page.So the total cycles per page, in an unroll of N is around:N*8 + (256/N)*(N*8* + 10)So if we make N too big, we'll do too much self-mod, and if we make N too small, the copy loop won't be fast. What's best?Setting up this equation in a dirty fragment of perl, I get: 4:2720rate 10.6250008:2432rate 9.50000016:2336rate 9.12500032:2384rate 9.31250064:2600rate 10.156250128:3092rate 12.078125(rate is cycles per byte). So 16 appears best.The result (at the hand-waving level) is a block copy running at 9.1 cycles per byte. I've probably left out some overhead in this, but I think the overall analysis is OK. BugBear Edited May 8, 2018 by bugbear Quote Link to comment Share on other sites More sharing options...
ivop Posted May 8, 2018 Share Posted May 8, 2018 (edited) ldx #0 ; 2 lda $2000,x ; 4 sta $3000,x ; 4 lda $2100,x sta $3100,x lda $2200,x sta $3200,x lda $2300,x sta $3300,x lda $2400,x sta $3400,x lda $2500,x sta $3500,x lda $2600,x sta $3600,x lda $2700,x sta $3700,x lda $2800,x sta $3800,x lda $2900,x sta $3900,x lda $2a00,x sta $3a00,x lda $2b00,x sta $3b00,x lda $2c00,x sta $3c00,x lda $2d00,x sta $3d00,x lda $2e00,x sta $3e00,x lda $2f00,x sta $3f00,x inx ; 2 bne copy ; 3 ; (2 + (16*8 + 2 + 3) * 256)/4096 = 8.312988 cycles per byte This is how I would bulk copy 4kB (16 pages). Edited May 8, 2018 by ivop Quote Link to comment Share on other sites More sharing options...
Rybags Posted May 8, 2018 Share Posted May 8, 2018 (edited) Absolute fastest is just LDA SRC STA DEST LDA SRC + 1 STA DEST + 1 8 cycles per byte but an obscene memory requirement of <screen size> * 6 so about 48K for a hires or 24K for a Graphics 7 bitmap. But not so bad for a text screen, under 6K. A less costly method in terms of memory and not too much worse for cycles is using indexed addressing for every page of screen RAM. To ensure maximum efficiency, set your screen base to minimize crossing page boundaries in the move. LDX #0 COPY LDA SRC,X STA DEST,X LDA SRC+$100,X STA DEST+$100,X etc. until LDA SRC+$1E00,X STA DEST+$1E00,X INX BNE COPY Alternative to this you could use self-modifying code to make the internals of the loop smaller and just modify the operands and run the loop again. Or better alternative - given the flexibility of the Atari screen architecture you can use DLists with LMS and fine scrolling to mimimize or eliminate having to move memory around. Edited May 8, 2018 by Rybags Quote Link to comment Share on other sites More sharing options...
Irgendwer Posted May 8, 2018 Share Posted May 8, 2018 sta $3000,x ; 4 Unfortunately not (5). 1 Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted May 8, 2018 Share Posted May 8, 2018 if I have the mem... I would do brute force .rept 4096 LDA src+# STA dst+# .endr 1 Quote Link to comment Share on other sites More sharing options...
bugbear Posted May 8, 2018 Author Share Posted May 8, 2018 >> This is how I would bulk copy 4kB (16 pages). Unrolling the outer loop instead of the inner - I love it! BugBear Quote Link to comment Share on other sites More sharing options...
Rybags Posted May 8, 2018 Share Posted May 8, 2018 In the modern day with large flashcarts and Ram sizes, entirely feasible to do enormous brute-force lumps of code rather than time consuming loops. Also worth mentioning the rendering technique exposed by Heaven in the Activision Zone Ranger game. Has a thread from several years ago somewhere around here. Quote Link to comment Share on other sites More sharing options...
ivop Posted May 8, 2018 Share Posted May 8, 2018 Unfortunately not (5). Aw, yes. Good catch. (2 + (16*9 + 2 + 3) * 256)/4096 = 9.312988 cycles per byte I just copied 4 and 4 from bugbear's post, but a abs,x store is indeed one cycle more. Quote Link to comment Share on other sites More sharing options...
bugbear Posted May 8, 2018 Author Share Posted May 8, 2018 In the modern day with large flashcarts and Ram sizes, entirely feasible to do enormous brute-force lumps of code rather than time consuming loops. In the modern age, a GPU can shift giga-bytes per second. But (for me) being full-on-retro is more fun. BugBear Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.