Speed up your memset and memcpy with loop unrolling

bithopper · February 12, 2023

While optimizing a few C methods in my current CDFJ game project, I realized just how often I use the standard myMemcpy and myMemset functions provided in the file 'defines_cdfj.h'.

These are are basic loops that work fine, and have a small memory footprint. I was using them all over the place, for initalization routines, sprite graphics, etc.

For example:

void myMemsetInt(unsigned int* destination, int fill, int count)
{
	int i;
	for (i=0; i<count; ++i) {
	destination[i] = fill;
	}
}

The resulting ARM code the compiler produces looks something like:

myMemsetInt:
        mov     r3, #0
.L2:
        cmp     r3, r2
        bxge    lr
        str     r1, [r0, r3, lsl #2]
        add     r3, r3, #1
        b       .L2

Without knowing too much about ARM coding: As rule of thumb we can assume each line is executed by the ARM processor in one cycle, here. (Some commands will take longer, multiplications, for example.)

So we see 5 lines running in a loop, where only one of these lines is actually setting a value in memory, the other 4 are loop related, which is a pretty hefty overhead, isn't it? Maybe we can do better?

(Note: Actually the compiler will produce Thumb16 code, which looks even worse).

All of a sudden I remembered how you sped up screen clearing in 68000 assembler on the Atari ST back in those days and I realized: By unrolling the loop we will get rid of much of this overhead!

OK, let's do it. Lets assume we want to set 8 values somewhere in memory. So we can get rid of the loop completely by writing something like this:

void myMemsetInt_8(unsigned int* destination, int fill)
{
        *destination++ = fill;
        *destination++ = fill;
        *destination++ = fill;
        *destination++ = fill;
        *destination++ = fill;
        *destination++ = fill;
        *destination++ = fill;
        *destination++ = fill;
}

Which compiles to:

myMemsetInt_8:
        str     r1, [r0]
        str     r1, [r0, #4]
        str     r1, [r0, #8]
        str     r1, [r0, #12]
        str     r1, [r0, #16]
        str     r1, [r0, #20]
        str     r1, [r0, #24]
        str     r1, [r0, #28]
        bx      lr

That's better, right? Just 9 commands required for setting 8 integers in the memory - as opposed to roughly 5 * 8 = 40 with the simple loop.

OK, we did it. End of story.

Err, wait

What we really need is a more general solution with reasonable memory footprint, that is capable of setting arbitrary sizes of memory, just like the standard loop, but still getting good benefit from this principle of unrolling.

After some thinking about how to do certain things in C - I ended up writing this piece of code:

void myMemsetInt_unrolled16(unsigned int* destination, int fill, int count)
{
    static const void* jmparray[] = {&&L_16,&&L_15,&&L_14,&&L_13,&&L_12,&&L_11,&&L_10,&&L_09,&&L_08,&&L_07,&&L_06,&&L_05,&&L_04,&&L_03,&&L_02,&&L_01};
    // NOTE: GCC -Os will hopefully optimize this code so that it computes the address rather than actually builds the table!

    loop:
  
    if(count < 16) goto *jmparray[count];   // Use unrolled fill for <count> Integers.

          *destination++ = fill;
    L_01: *destination++ = fill;
    L_02: *destination++ = fill;
    L_03: *destination++ = fill;
    L_04: *destination++ = fill;
    L_05: *destination++ = fill;
    L_06: *destination++ = fill;
    L_07: *destination++ = fill;
    L_08: *destination++ = fill;
    L_09: *destination++ = fill;
    L_10: *destination++ = fill;
    L_11: *destination++ = fill;
    L_12: *destination++ = fill;
    L_13: *destination++ = fill;
    L_14: *destination++ = fill;
    L_15: *destination++ = fill;
    L_16: 

    if(count <= 16) return; // DONE

    count -= 16;
    goto loop;
}

The compiler produced this:

myMemsetInt_unrolled16:
        ldr     r3, .L22
.L2:
        cmp     r2, #15
        ldrle   pc, [r3, r2, lsl #2]  @ indirect memory jump
        str     r1, [r0], #4
        str     r1, [r0], #4
        str     r1, [r0], #4
        str     r1, [r0], #4
        str     r1, [r0], #4
        str     r1, [r0], #4
        str     r1, [r0], #4
        str     r1, [r0], #4
        str     r1, [r0], #4
        str     r1, [r0], #4
        str     r1, [r0], #4
        str     r1, [r0], #4
        str     r1, [r0], #4
        str     r1, [r0], #4
        str     r1, [r0], #4
        str     r1, [r0], #4
        cmp     r2, #16
        bxle    lr
        sub     r2, r2, #16
        b       .L2
.L22:
        .word   .LANCHOR0

Doesn't look too bad in regard of loop overhead, right? You will realize the bigger size, though. Sure, the longer you unroll it, the better the overhead ratio but also the bigger the code size.

I am very aware of the memory restrictions on the VCS, but when using bank switching models like CDFJ with 32K I guess the speed boost might well be worth the memory trade off.

If the unrolling seems too excessive, you can adapt it to any other number. Maybe 8 will be a good compromise between loop overhead and memory size.

The 'unrolling factor' of 16 I used in this code just gives me the good feeling that i don't waste any loop overhead on my sprites, as most of them are not as tall :))

This idea is also applicable with other loops, for example the byte wise myMempy:

void myMemcpy_unrolled16(unsigned char* destination, unsigned char* source, int count)
{
    static const void* jmparray[] = {&&L_16,&&L_15,&&L_14,&&L_13,&&L_12,&&L_11,&&L_10,&&L_09,&&L_08,&&L_07,&&L_06,&&L_05,&&L_04,&&L_03,&&L_02,&&L_01};
    // NOTE: GCC -Os hopefully will optimize this code so that it computes the address rather than actually builds the table!

    loop:

    if(count < 16) goto *jmparray[count];   // Use unrolled copy for <count> bytes.

    L_00: *destination++ = *source++;
    L_01: *destination++ = *source++;
    L_02: *destination++ = *source++;
    L_03: *destination++ = *source++;
    L_04: *destination++ = *source++;
    L_05: *destination++ = *source++;
    L_06: *destination++ = *source++;
    L_07: *destination++ = *source++;
    L_08: *destination++ = *source++;
    L_09: *destination++ = *source++;
    L_10: *destination++ = *source++;
    L_11: *destination++ = *source++;
    L_12: *destination++ = *source++;
    L_13: *destination++ = *source++;
    L_14: *destination++ = *source++;
    L_15: *destination++ = *source++;
    L_16:

    if(count <= 16) return; // DONE

    count -= 16;
    goto loop;
}

As mentioned in the code comments: The jump tables are only provided to let the compilers optimizer know about our intentions. If the optimizer does it's job, these tables are not in the result. The algorithm is optimized for small sizes, i.e. it is designed in a way that, compared to the simple loop, it has almost no overhead when setting just a single value. From the second integer on you get the sheer power of unrolling.

When I showed the optimized copy and set methods to Darrell, he encouraged me to this post. May someone find it useful as well - and now it's time to copy some more sprites

>> UPDATE: For some custom native ARM Thumb16 implementations of myMemcpy and myMemsetInt, please look further down <<

bithopper · February 12, 2023

A word about the algorithm and it's C implementation: As mentioned, while the technique is well known to some, it's implementation in C proved a bit of a challenge.

I tried different approaches and rejected all but the presented for various reasons:

Generally speaking, the ultimate goal was C code that compiles to perfect ARM code, meaning you could not find a better solution when writing directly in ARM assembler.

At least, we would want to come reasonably close.

This means we must write some C code that nudges the C compiler's optimizer into the right direction.

The best results I got was with using a non standard C feature called 'labels as values' that is common in GCC and other compilers. This allows for calculated jumps and is used in the presented methods. Other approaches in pure standard C, utilizing the 'switch' statement like in Duff's Device, which I found afterwards in Wikipedia, resulted in considerably worse compilation results, at least in my tests.

Would be interesting to see, if some ARM code, natively written in assembler, could still be any faster somehow. Please let me know!

Thomas Jentzsch · February 12, 2023

:idea: When doing assembler, have a look at the stmia opcode.

bithopper · February 12, 2023

Thanks, Thomas! Loading/storing multiple registers sure is a good way to further speed up things. If I remember correctly, this was even possible with 68000.

Do you have any idea of how to 'persuade' the compiler optimizer to make use of it? Up to now, the code is platform independend...

Thomas Jentzsch · February 12, 2023

No idea, sorry. I suppose using inline Assembler might be the only option here.

Dionoid · February 12, 2023

Related to this, for Lode Runner 2600 my biggest challenge was to bring down the number of bytes for the ARM binary, so preferring size over speed.

I also experimented with the functions in defines_cdfj.h and found that the next C code for myMemsetInt results in 8 bytes less for the generated ARM binary:

void myMemsetInt(unsigned int *destination, int fill, int count) 
{
    while (count > 0) destination[--count] = fill; 
}

...resulting in the ARM assembly code below. For some reason it looks more verbose than the original ARM assembly code for myMemsetInt, but it should use 8 bytes less ARM binary:

myMemsetInt:
.L1:
	cmp    r2, #0
	bgt    .L2
	bx     lr
.L2:
	subs   r2, r2, #1
	lsls   r3, r2, #2
	str    r1, [r0, r3]
	b     .L1

+SpiceWare · February 12, 2023

Thanks for sharing this! I've updated the Tips and Tricks topic.

bithopper · February 13, 2023

Spent another couple of hours hours at the bytewise memcpy, as the result of the C compiler optimization still left something to desire.

After several frustrating attempts I took a deep breath and - decided to write a custom native Thumb16 code memcpy.

Maybe it is also of use for you fellow hero (or heroine?!) of the VCS.

I reduced the unrolling to 8, now featuring two phases: A faster 'block' copy (although still single value copy instructions, the multiple register load/store functions

appear to work only with whole words), followed by 0 to 7 final single byte copies.

As the overall size is ~108 bytes, this implementation will not be suited for every VCS project (most certainly not for Dionoid 's Loderunner :-D ).

It is still optimized for small copy sizes, though. I aimed for a low overhead when copying just a few bytes. When copying more than a few dozen bytes,

we could use another memcpy implementation that aligns source and destination pointers and then ends up in a super fast integer copy. These algorithms

are way faster for big copy sizes. However they tend to be even bulkier, with a sobering overhead ratio for the first few bytes (how big are your sprites, again? )


void myMemcpy_br8(unsigned char* destination, unsigned char* source, int count)
{
  // (C) 2023 oliver gross aka bithopper - use at your own RISC

  // V0.9.230215_1

  asm volatile(
        "cmp     r2, #7\n\t"
        "ble     .LL2\n\t"
        "sub    r2, #8\n\t"
    ".LL3:\n\t"
        "ldrb    r3, [r1]\n\t"
        "strb    r3, [r0]\n\t"
        "ldrb    r3, [r1, #1]\n\t"
        "strb    r3, [r0, #1]\n\t"
        "ldrb    r3, [r1, #2]\n\t"
        "strb    r3, [r0, #2]\n\t"
        "ldrb    r3, [r1, #3]\n\t"
        "strb    r3, [r0, #3]\n\t"
        "ldrb    r3, [r1, #4]\n\t"
        "strb    r3, [r0, #4]\n\t"
        "ldrb    r3, [r1, #5]\n\t"
        "strb    r3, [r0, #5]\n\t"
        "ldrb    r3, [r1, #6]\n\t"
        "strb    r3, [r0, #6]\n\t"
        "ldrb    r3, [r1, #7]\n\t"
        "strb    r3, [r0, #7]\n\t"
        "add     r0, #8\n\t"
        "add     r1, #8\n\t"
        "sub     r2, #8\n\t"
        "bpl     .LL3\n\t"
        "add    r2, #8\n\t"
    ".LL2:\n\t"
        "lsl     r2, r2, #3\n\t"
        "mov     r3, #55\n\t"
        "sub     r3, r2\n\t"
        "add     pc, r3\n\t"
        "ldrb    r3, [r1]\n\t"
        "add     r1, #1\n\t"
        "strb    r3, [r0]\n\t"
        "add     r0, #1\n\t"
        "ldrb    r3, [r1]\n\t"
        "add     r1, #1\n\t"
        "strb    r3, [r0]\n\t"
        "add     r0, #1\n\t"
        "ldrb    r3, [r1]\n\t"
        "add     r1, #1\n\t"
        "strb    r3, [r0]\n\t"
        "add     r0, #1\n\t"
        "ldrb    r3, [r1]\n\t"
        "add     r1, #1\n\t"
        "strb    r3, [r0]\n\t"
        "add     r0, #1\n\t"
        "ldrb    r3, [r1]\n\t"
        "add     r1, #1\n\t"
        "strb    r3, [r0]\n\t"
        "add     r0, #1\n\t"
        "ldrb    r3, [r1]\n\t"
        "add     r1, #1\n\t"
        "strb    r3, [r0]\n\t"
        "add     r0, #1\n\t"
        "ldrb    r3, [r1]\n\t"
        "add     r1, #1\n\t"
        "strb    r3, [r0]\n\t"
        "add     r0, #1\n\t"
        : : : "r3", "cc", "memory"
  );
}

A word of caution - although it performed nicely in my WIP, using GCC, it is not thorrowly tested, so please regard this as highly experimental code that might still see

some updates in near future. (Did I mention it is now highly platform dependend?)

If someone uses this method and finds a bug, or has some other suggestions about the ARM code please let me know!

Edited February 15, 2023 by bithopper
Code Update to V0.9.230215_1 (slightly improved block loop)

bithopper · February 15, 2023

Here the corresponding optimized custom memsetInt in native assembler code. The same principle applies as with the assembler memcpy in my last post: The memory is set blockwise, 8 integers (i.e. 32 bit words) per loop pass, using well suited special ARM instructions (thanks, Thomas) up to a point where the remaining number of integers/words is below blocksize. These are worked then without another loop. Once again the aim was a reasonably low overhead for small(ish) sizes of memory to set. Nevertheless, when filling/clearing larger buffers, the method still performs well. It's size is ~48 bytes of arm thumb code. Remember: The destination pointer needs to be 4-aligned.

void myMemsetInt_br8(unsigned int* destination, unsigned int fill, int count)
{
  // (C) 2023 oliver gross aka bithopper - use at your own RISC

  // V0.9.230219_0
  
  asm volatile(
        "cmp     r2, #8\n\t"
        "blt     .LL02\n\t"
        "push   {r4-r6}\n\t"
        "mov    r4, r1\n\t"
        "mov    r5, r1\n\t"
        "mov    r6, r1\n\t"
        "sub    r2, #8\n\t"
    ".LL01:\n\t"
        "stmia  r0!, {r1, r4-r6}\n\t"
        "stmia  r0!, {r1, r4-r6}\n\t"
        "sub    r2, #8\n\t"
        "bpl    .LL01\n\t"
        "add    r2, #8\n\t"
        "pop    {r4-r6}\n\t"
    ".LL02:\n\t"
        "lsl     r2, r2, #1\n\t"
        "mov     r3, #13\n\t"
        "sub     r3, r2\n\t"
        "add     pc, r3\n\t"
        "stmia   r0!, {r1}\n\t"
        "stmia   r0!, {r1}\n\t"
        "stmia   r0!, {r1}\n\t"
        "stmia   r0!, {r1}\n\t"
        "stmia   r0!, {r1}\n\t"
        "stmia   r0!, {r1}\n\t"
        "stmia   r0!, {r1}\n\t"
        : : : "r3", "cc", "memory"
  );
}

The code seems to run fine in my WIP, compiled with GCC, so I decided to post it here. However: This is a beta version!

You are welcome to use it, but be warned: It is not thorrowly tested and there might be bugfixes and other updates in near future.

Please let me know of any issues or suggestions, feedback is most welcome 🙂

Edited February 19, 2023 by bithopper
Code update to V0.9.230219_0 (amendment: fill is now unsigned integer)

bithopper · February 16, 2023

Got some questions about how much actually is gained by using the presented methods.

As I had no detailled answer to that, only some promising observations in my game effort,

here some calculations, based on a data sheet with cycle timings (from ARM, also attached).

This might not completely replace actual measurements on real hardware but

is probably not too far off to shed some light on it.

The attached PDF analyzes cycle count for four relevant variants discussed in the topic

and compares them, also graphically. It was done in Excel and I hope the cycle calculations

are correct - you might do a countercheck and tell me where I miscounted...

(In case you are wondering: The cycle count shown is calculated for a non-inlined function,

so it includes the cycle counts for the method call and its return.)

In case the PDF doesn't speak for itself and needs some explanation, please let me know as well.

image.png.e45f652de6bf7497dff5c98f6af23c44.png

myMemsetInt_CycleComparison_4Variants.pdf

Edited February 17, 2023 by bithopper
Amended content and added variant

bithopper · February 16, 2023

The gist: myMemsetInt_br8 always runs faster, right from count 1, and it's average speed quickly increases with the size to fill, until it reaches almost 7 times (6,85) that of the original loop and still nearly 6 times (5.7) that of the fastest single loop in this comparison.

To be exact: To set one integer (count 1) takes with 19 cycles (including the call overhead for non-inlined code) roughly the same time as most variants. The original is slower at the beginning as it has some initialization instructions. It needs 32 cycles. At count 8 a little setback takes place, reducing the average speed, however the power of the block loop kicks in now and makes up for it very quickly (see data series and chart).

Important here: We are never slower than any of the the simple loops, even in this 'fast block preparation phase'!

Some numbers: To fill 7 integers takes less than half the time of the fastest single loop, 17 integers a third, 40 a quarter, 120 a fifth, and to fill/clear 1000 integers only takes a bit over a sixth (5.6) of the time, with lean 1.8 cycles per filled integer.

So much for the theory - now let's see it in practice.

bithopper · February 16, 2023

Interestlingly, Dionoid's size optimized code is not only smaller, but also performs slightly better than the original. The original has some register saving overhead; for some reason it needs one register more than the other simple loop memsets.

When I played around with thumb16 code, I found that incrementing the pointer instead of indexing often leads to slightly better loop results. At least with the compiler flags and options used. Here, the optimizer found and used a special ARM instruction (STMIA).

This is why the presented alternate size optimized single loop code is two more bytes smaller (12 bytes) and also slightly faster.

Speed up your memset and memcpy with loop unrolling

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Recently Browsing 0 members