Over-engineering Cyclic Array Rotation In C#

Rotating an array in an efficient way is a classical programming exercise.

It’s also one that’s ripe for over-engineering for funsies.

This article describes five approaches for rotating an array and ranks their performance against each other.

The Problem

We have some array of integers of length N, from zero to infinity and beyond.

We must rotate that array some K number of times.

If K is 3 then every element in the array will move three positions to the right, except the last three elements, which will move to the beginning.

For example, if we start with this array…

[0] [1] [2] [3] [4] [5]

…and rotate it by 3, we must end up with…

[3] [4] [5] [0] [1] [2]

Constraints

No in-place rotation - return a new array.
N can be of arbitrary size or even zero.
K can be of arbitrary size or zero but not negative - only rotate right.

A Naive Solution

There isn’t really one for this exercise. Even brute-force in-place rotation does not make sense as we need to create a new array anyway.

An Efficient Solution

An obvious solution to this exercise is to move all items to the new array by:

Calculating old index + distance
Getting its remainder over the array length to account for overflow.

public static int[] RotateByRemainderIndexing(int[] input, int distance)
{
    // validate
    if (input == null) throw new ArgumentNullException(nameof(input));
    if (distance < 0) throw new ArgumentOutOfRangeException(nameof(distance));
    if (input.Length == 0) return new int[0];

    // rotate
    var result = new int[input.Length];
    for (int i = 0; i < input.Length; ++i)
    {
        int j = (i + distance) % input.Length;
        result[j] = input[i];
    }
    return result;
}

This is already fit for purpose.

It’s a simple O(n) approach, easy to understand and scales in a linear fashion.

That said, we are still iterating the array and performing arithmetic on every single step. Yet when you think about it, all we are doing is swapping two array segments, nothing more.

Is there a way to avoid this extra leg work and go straight to copying memory?

Well, it turns out there are a number of ways to do just that.

A More Efficient Solution? Array.Copy

As a more efficient approach, we can divide an array into two segments at distance % input.Length and then copy over those segments to the new array.

public static int[] RotateByArrayCopy(int[] input, int distance)
{
    // validate
    if (input == null) throw new ArgumentNullException(nameof(input));
    if (distance < 0) throw new ArgumentOutOfRangeException(nameof(distance));
    if (input.Length == 0) return new int[0];

    // rotate
    var result = new int[input.Length];
    int diff = distance % input.Length;
    Array.Copy(input, 0, result, diff, input.Length - diff);
    Array.Copy(input, input.Length - diff, result, 0, diff);
    return result;
}

This solution does away with iterations in favour of Array.Copy to copy the underlying array data in one go.

Array.Copy is general-use and works for both value and reference types, performing boxing, unboxing and casting as required.

A More Efficient Solution? Buffer.BlockCopy

Another way of copying underlying memory is with Buffer.BlockCopy.

public static int[] RotateByBufferCopy(int[] input, int distance)
{
    // validate
    if (input == null) throw new ArgumentNullException(nameof(input));
    if (distance < 0) throw new ArgumentOutOfRangeException(nameof(distance));
    if (input.Length == 0) return new int[0];

    // rotate
    var size = sizeof(int);
    var result = new int[input.Length];
    int diff = distance % input.Length;
    Buffer.BlockCopy(input, 0, result, diff * size, (input.Length - diff) * size);
    Buffer.BlockCopy(input, (input.Length - diff) * size, result, 0, diff * size);
    return result;
}

Buffer.BlockCopy is a bit more low-level than Array.Copy and will copy the underlying bytes without regards to the type they represent.

Because of that, we must include the type size in the slice calculations.

There is also a Buffer.MemoryCopy method that takes memory pointers but as it is not CLS-compliant, I have not included it in this comparison.

A More Efficient Solution? Span.Slice.CopyTo

Yet another way to slice and copy an array is to use a Span<T>.

public static int[] RotateBySpanCopy(int[] input, int distance)
{
    // validate
    if (input == null) throw new ArgumentNullException(nameof(input));
    if (distance < 0) throw new ArgumentOutOfRangeException(nameof(distance));
    if (input.Length == 0) return new int[0];

    // rotate
    var result = new int[input.Length];
    var target = new Span<int>(result);
    var diff = distance % input.Length;
    var source1 = new Span<int>(input, 0, input.Length - diff);
    source1.CopyTo(target.Slice(diff, input.Length - diff));
    var source2 = new Span<int>(input, input.Length - diff, diff);
    source2.CopyTo(target.Slice(0, diff));

    return result;
}

Span<T> lets us work with contiguous segments of memory and enables neat stuff like in-place casting without boxing and slicing without allocation.

This algorithm uses that exact slicing ability to copy the underlying memory from the two original segments to the two target segments.

Span<T> is aware of the size it is supposed to represent (that’s what the <T> is for) and therefore we do not need to include the type size in the slice calculations.

A More Efficient Solution? Unsafe.CopyBlock

The final contender in this over-engineering race is the new(ish) Unsafe.CopyBlock from the System.Runtime.CompilerServices (aka “shush, I know what I’m doing”) assembly.

public unsafe static int[] RotateByUnsafeCopy(int[] input, int distance)
{
    // validate
    if (input == null) throw new ArgumentNullException(nameof(input));
    if (distance < 0) throw new ArgumentOutOfRangeException(nameof(distance));
    if (input.Length == 0) return new int[0];

    // prepare to rotate
    var result = new int[input.Length];
    int size = Unsafe.SizeOf<int>();
    int diff = distance % input.Length;

    // pin memory locations
    fixed (int* target1 = result.AsSpan(diff, input.Length - diff))
    fixed (int* slice1 = input.AsSpan(0, input.Length - diff))
    fixed (int* target2 = result.AsSpan(0, diff))
    fixed (int* slice2 = input.AsSpan(input.Length - diff, diff))
    {
        // copy the underlying memory in the array
        Unsafe.CopyBlock(target1, slice1, (uint)((input.Length - diff) * size));
        Unsafe.CopyBlock(target2, slice2, (uint)(diff * size));
    }

    return result;
}

Unsafe.CopyBlock is as brute-force as it gets.

It will copy memory from one place to the other without regard to the runtime shufling memory as it goes. That’s why we need to pin any memory we need to touch, lest the runtime swipes it off our algorithm’s feet.

To use it, we need to mark the method as unsafe and build the assembly with /unsafe turned on, just to prove how desperate we are.

Performance

Here is how these algorithms stack against each other…

Method	N	Mean	Error	StdDev	Ratio	RatioSD	Rank	Gen 0/1k Op	Gen 1/1k Op	Gen 2/1k Op	Allocated Memory/Op
RotateByRemainderIndexing	10	55.32 ns	1.2551 ns	1.6320 ns	1.00	0.00	**	0.0203	-	-	64 B
RotateByArrayCopy	10	54.73 ns	1.4390 ns	1.9698 ns	0.99	0.05	**	0.0203	-	-	64 B
RotateByBufferCopy	10	46.79 ns	0.9520 ns	0.8905 ns	0.84	0.03	***	0.0203	-	-	64 B
RotateBySpanCopy	10	44.12 ns	0.7237 ns	0.6416 ns	0.80	0.02	**	0.0203	-	-	64 B
RotateByUnsafeCopy	10	40.55 ns	0.5346 ns	0.5001 ns	0.73	0.02	*	0.0203	-	-	64 B

RotateByRemainderIndexing	100	490.21 ns	9.9104 ns	16.2831 ns	1.00	0.00	**	0.1345	-	-	424 B
RotateByArrayCopy	100	116.81 ns	2.5840 ns	3.0760 ns	0.24	0.01	***	0.1347	-	-	424 B
RotateByBufferCopy	100	97.67 ns	1.9820 ns	2.2824 ns	0.20	0.01	**	0.1347	-	-	424 B
RotateBySpanCopy	100	94.22 ns	1.9825 ns	2.2830 ns	0.19	0.01	*	0.1347	-	-	424 B
RotateByUnsafeCopy	100	92.16 ns	1.9450 ns	2.5965 ns	0.19	0.01	*	0.1347	-	-	424 B

RotateByRemainderIndexing	1000	4,674.12 ns	81.3798 ns	72.1410 ns	1.00	0.00	**	1.2741	-	-	4024 B
RotateByArrayCopy	1000	613.15 ns	10.8852 ns	9.6494 ns	0.13	0.00	*	1.2779	-	-	4024 B
RotateByBufferCopy	1000	604.01 ns	11.9966 ns	18.3201 ns	0.13	0.01	*	1.2779	-	-	4024 B
RotateBySpanCopy	1000	625.79 ns	12.2431 ns	10.8532 ns	0.13	0.00	*	1.2779	-	-	4024 B
RotateByUnsafeCopy	1000	599.51 ns	14.3827 ns	13.4535 ns	0.13	0.00	*	1.2779	-	-	4024 B

RotateByRemainderIndexing	10000	47,102.64 ns	1,061.3413 ns	1,993.4563 ns	1.00	0.00	**	12.6343	-	-	40024 B
RotateByArrayCopy	10000	6,383.94 ns	47.8780 ns	42.4426 ns	0.14	0.01	*	12.6572	-	-	40024 B
RotateByBufferCopy	10000	6,470.84 ns	127.7785 ns	119.5241 ns	0.14	0.01	*	12.6572	-	-	40024 B
RotateBySpanCopy	10000	6,677.21 ns	128.7819 ns	162.8678 ns	0.14	0.01	*	12.6572	-	-	40024 B
RotateByUnsafeCopy	10000	6,547.83 ns	126.1428 ns	172.6656 ns	0.14	0.01	*	12.6572	-	-	40024 B

RotateByRemainderIndexing	100000	461,920.31 ns	6,744.3065 ns	6,308.6284 ns	1.00	0.00	***	124.5117	124.5117	124.5117	400024 B
RotateByArrayCopy	100000	73,566.10 ns	1,464.5060 ns	1,686.5273 ns	0.16	0.00	**	124.8779	124.8779	124.8779	400024 B
RotateByBufferCopy	100000	73,716.92 ns	1,468.5753 ns	2,286.3948 ns	0.16	0.01	**	124.8779	124.8779	124.8779	400024 B
RotateBySpanCopy	100000	71,036.89 ns	802.9858 ns	626.9185 ns	0.15	0.00	*	124.8779	124.8779	124.8779	400024 B
RotateByUnsafeCopy	100000	73,232.43 ns	1,416.7108 ns	1,325.1922 ns	0.16	0.00	**	124.8779	124.8779	124.8779	400024 B

RotateByRemainderIndexing	1000000	5,715,133.83 ns	52,297.2313 ns	48,918.8626 ns	1.00	0.00	**	273.4375	273.4375	273.4375	4000024 B
RotateByArrayCopy	1000000	3,471,175.86 ns	40,279.0739 ns	35,706.3499 ns	0.61	0.01	*	152.3438	152.3438	152.3438	4000024 B
RotateByBufferCopy	1000000	3,484,901.70 ns	75,821.6356 ns	77,863.2374 ns	0.61	0.01	*	152.3438	152.3438	152.3438	4000024 B
RotateBySpanCopy	1000000	3,395,618.79 ns	67,417.6361 ns	184,554.6909 ns	0.57	0.07	*	152.3438	152.3438	152.3438	4000024 B
RotateByUnsafeCopy	1000000	3,492,599.63 ns	52,870.8936 ns	46,868.6701 ns	0.61	0.01	*	152.3438	152.3438	152.3438	4000024 B

RotateByRemainderIndexing	10000000	74,122,529.83 ns	1,659,361.8996 ns	2,037,845.4327 ns	1.00	0.00	**	111.1111	111.1111	111.1111	40000024 B
RotateByArrayCopy	10000000	36,266,718.27 ns	386,807.3179 ns	361,819.8052 ns	0.49	0.01	***	133.3333	133.3333	133.3333	40000024 B
RotateByBufferCopy	10000000	33,913,202.44 ns	349,838.7446 ns	327,239.3788 ns	0.46	0.01	*	125.0000	125.0000	125.0000	40000024 B
RotateBySpanCopy	10000000	35,197,233.69 ns	678,249.4610 ns	807,407.7360 ns	0.48	0.02	**	133.3333	133.3333	133.3333	40000024 B
RotateByUnsafeCopy	10000000	36,101,336.10 ns	719,881.3796 ns	739,265.1758 ns	0.49	0.02	***	133.3333	133.3333	133.3333	40000024 B

RotateByRemainderIndexing	100000000	650,980,738.30 ns	7,266,152.6154 ns	6,796,763.6647 ns	1.00	0.00	***	-	-	-	400000024 B
RotateByArrayCopy	100000000	334,371,656.80 ns	6,517,933.2712 ns	11,243,114.6469 ns	0.52	0.01	**	-	-	-	400000024 B
RotateByBufferCopy	100000000	325,036,779.43 ns	7,624,095.8569 ns	8,157,697.1495 ns	0.50	0.01	*	-	-	-	400000024 B
RotateBySpanCopy	100000000	318,656,163.20 ns	2,968,558.6617 ns	2,776,791.6140 ns	0.49	0.00	*	-	-	-	400000024 B
RotateByUnsafeCopy	100000000	319,322,045.17 ns	2,789,657.9564 ns	2,609,447.7833 ns	0.49	0.00	*	-	-	-	400000024 B

Here are some interesting bits from this benchmark:

Just about anything is better than remainder indexing.
The mean time of the memory copy methods is an order of magnitude lower than remainder indexing up 100k array size.
Said performance takes a hit at around 1M array size. The exact point may be specific to my own computer spec.
Copy mean time appears to stabilize at around half of remainder indexing. Again, the exact point may be specific to my own computer spec.
The garbage collector goes to sleep when we cross-over the 100M array size mark. That may be because we’re only allocating memory in the large object heap at this point, which goes straight into GC Generation 2.
None of the memory copy methods stand-out on the long run - they all behave the same.

In this case, it does not make sense to create a hybrid algorithm… But it does make sense to use one of the copy methods - which one wins, I’ll leave it up to you.

Takeaway

Sometimes one can benefit from over-engineering… Sometimes not… Sometimes both!

Neither of the fancy memory copy methods was significantly more performance than trusty old Array.Copy. Yet even Array.Copy by itself provided significant performance benefits over a naive remainder indexing approach. Why index what you can clone?

Resources

You can find all the code for this post in the Quicker repository, including all the benchmarks.

On Distributed Computing