Over-engineering Cyclic Array Rotation In C#

Over-engineering Cyclic Array Rotation In C#

Rotating an array in an efficient way is a classical programming exercise.

It’s also one that’s ripe for over-engineering for funsies.

This article describes five approaches for rotating an array and ranks their performance against each other.

The Problem

We have some array of integers of length N, from zero to infinity and beyond.

We must rotate that array some K number of times.

If K is 3 then every element in the array will move three positions to the right, except the last three elements, which will move to the beginning.

For example, if we start with this array…

[0] [1] [2] [3] [4] [5]

…and rotate it by 3, we must end up with…

[3] [4] [5] [0] [1] [2]

Constraints

  • No in-place rotation - return a new array.
  • N can be of arbitrary size or even zero.
  • K can be of arbitrary size or zero but not negative - only rotate right.

A Naive Solution

There isn’t really one for this exercise. Even brute-force in-place rotation does not make sense as we need to create a new array anyway.

An Efficient Solution

An obvious solution to this exercise is to move all items to the new array by:

  • Calculating old index + distance
  • Getting its remainder over the array length to account for overflow.
public static int[] RotateByRemainderIndexing(int[] input, int distance)
{
    // validate
    if (input == null) throw new ArgumentNullException(nameof(input));
    if (distance < 0) throw new ArgumentOutOfRangeException(nameof(distance));
    if (input.Length == 0) return new int[0];

    // rotate
    var result = new int[input.Length];
    for (int i = 0; i < input.Length; ++i)
    {
        int j = (i + distance) % input.Length;
        result[j] = input[i];
    }
    return result;
}

This is already fit for purpose.

It’s a simple O(n) approach, easy to understand and scales in a linear fashion.

That said, we are still iterating the array and performing arithmetic on every single step. Yet when you think about it, all we are doing is swapping two array segments, nothing more.

Is there a way to avoid this extra leg work and go straight to copying memory?

Well, it turns out there are a number of ways to do just that.

A More Efficient Solution? Array.Copy

As a more efficient approach, we can divide an array into two segments at distance % input.Length and then copy over those segments to the new array.

public static int[] RotateByArrayCopy(int[] input, int distance)
{
    // validate
    if (input == null) throw new ArgumentNullException(nameof(input));
    if (distance < 0) throw new ArgumentOutOfRangeException(nameof(distance));
    if (input.Length == 0) return new int[0];

    // rotate
    var result = new int[input.Length];
    int diff = distance % input.Length;
    Array.Copy(input, 0, result, diff, input.Length - diff);
    Array.Copy(input, input.Length - diff, result, 0, diff);
    return result;
}

This solution does away with iterations in favour of Array.Copy to copy the underlying array data in one go.

Array.Copy is general-use and works for both value and reference types, performing boxing, unboxing and casting as required.

A More Efficient Solution? Buffer.BlockCopy

Another way of copying underlying memory is with Buffer.BlockCopy.

public static int[] RotateByBufferCopy(int[] input, int distance)
{
    // validate
    if (input == null) throw new ArgumentNullException(nameof(input));
    if (distance < 0) throw new ArgumentOutOfRangeException(nameof(distance));
    if (input.Length == 0) return new int[0];

    // rotate
    var size = sizeof(int);
    var result = new int[input.Length];
    int diff = distance % input.Length;
    Buffer.BlockCopy(input, 0, result, diff * size, (input.Length - diff) * size);
    Buffer.BlockCopy(input, (input.Length - diff) * size, result, 0, diff * size);
    return result;
}

Buffer.BlockCopy is a bit more low-level than Array.Copy and will copy the underlying bytes without regards to the type they represent.

Because of that, we must include the type size in the slice calculations.

There is also a Buffer.MemoryCopy method that takes memory pointers but as it is not CLS-compliant, I have not included it in this comparison.

A More Efficient Solution? Span.Slice.CopyTo

Yet another way to slice and copy an array is to use a Span<T>.

public static int[] RotateBySpanCopy(int[] input, int distance)
{
    // validate
    if (input == null) throw new ArgumentNullException(nameof(input));
    if (distance < 0) throw new ArgumentOutOfRangeException(nameof(distance));
    if (input.Length == 0) return new int[0];

    // rotate
    var result = new int[input.Length];
    var target = new Span<int>(result);
    var diff = distance % input.Length;
    var source1 = new Span<int>(input, 0, input.Length - diff);
    source1.CopyTo(target.Slice(diff, input.Length - diff));
    var source2 = new Span<int>(input, input.Length - diff, diff);
    source2.CopyTo(target.Slice(0, diff));

    return result;
}

Span<T> lets us work with contiguous segments of memory and enables neat stuff like in-place casting without boxing and slicing without allocation.

This algorithm uses that exact slicing ability to copy the underlying memory from the two original segments to the two target segments.

Span<T> is aware of the size it is supposed to represent (that’s what the <T> is for) and therefore we do not need to include the type size in the slice calculations.

A More Efficient Solution? Unsafe.CopyBlock

The final contender in this over-engineering race is the new(ish) Unsafe.CopyBlock from the System.Runtime.CompilerServices (aka “shush, I know what I’m doing”) assembly.

public unsafe static int[] RotateByUnsafeCopy(int[] input, int distance)
{
    // validate
    if (input == null) throw new ArgumentNullException(nameof(input));
    if (distance < 0) throw new ArgumentOutOfRangeException(nameof(distance));
    if (input.Length == 0) return new int[0];

    // prepare to rotate
    var result = new int[input.Length];
    int size = Unsafe.SizeOf<int>();
    int diff = distance % input.Length;

    // pin memory locations
    fixed (int* target1 = result.AsSpan(diff, input.Length - diff))
    fixed (int* slice1 = input.AsSpan(0, input.Length - diff))
    fixed (int* target2 = result.AsSpan(0, diff))
    fixed (int* slice2 = input.AsSpan(input.Length - diff, diff))
    {
        // copy the underlying memory in the array
        Unsafe.CopyBlock(target1, slice1, (uint)((input.Length - diff) * size));
        Unsafe.CopyBlock(target2, slice2, (uint)(diff * size));
    }

    return result;
}

Unsafe.CopyBlock is as brute-force as it gets.

It will copy memory from one place to the other without regard to the runtime shufling memory as it goes. That’s why we need to pin any memory we need to touch, lest the runtime swipes it off our algorithm’s feet.

To use it, we need to mark the method as unsafe and build the assembly with /unsafe turned on, just to prove how desperate we are.

Performance

Here is how these algorithms stack against each other…



Method N Mean Error StdDev Ratio RatioSD Rank Gen 0/1k Op Gen 1/1k Op Gen 2/1k Op Allocated Memory/Op
RotateByRemainderIndexing 10 55.32 ns 1.2551 ns 1.6320 ns 1.00 0.00 ** 0.0203 - - 64 B
RotateByArrayCopy 10 54.73 ns 1.4390 ns 1.9698 ns 0.99 0.05 ** 0.0203 - - 64 B
RotateByBufferCopy 10 46.79 ns 0.9520 ns 0.8905 ns 0.84 0.03 *** 0.0203 - - 64 B
RotateBySpanCopy 10 44.12 ns 0.7237 ns 0.6416 ns 0.80 0.02 ** 0.0203 - - 64 B
RotateByUnsafeCopy 10 40.55 ns 0.5346 ns 0.5001 ns 0.73 0.02 * 0.0203 - - 64 B
                       
RotateByRemainderIndexing 100 490.21 ns 9.9104 ns 16.2831 ns 1.00 0.00 ** 0.1345 - - 424 B
RotateByArrayCopy 100 116.81 ns 2.5840 ns 3.0760 ns 0.24 0.01 *** 0.1347 - - 424 B
RotateByBufferCopy 100 97.67 ns 1.9820 ns 2.2824 ns 0.20 0.01 ** 0.1347 - - 424 B
RotateBySpanCopy 100 94.22 ns 1.9825 ns 2.2830 ns 0.19 0.01 * 0.1347 - - 424 B
RotateByUnsafeCopy 100 92.16 ns 1.9450 ns 2.5965 ns 0.19 0.01 * 0.1347 - - 424 B
                       
RotateByRemainderIndexing 1000 4,674.12 ns 81.3798 ns 72.1410 ns 1.00 0.00 ** 1.2741 - - 4024 B
RotateByArrayCopy 1000 613.15 ns 10.8852 ns 9.6494 ns 0.13 0.00 * 1.2779 - - 4024 B
RotateByBufferCopy 1000 604.01 ns 11.9966 ns 18.3201 ns 0.13 0.01 * 1.2779 - - 4024 B
RotateBySpanCopy 1000 625.79 ns 12.2431 ns 10.8532 ns 0.13 0.00 * 1.2779 - - 4024 B
RotateByUnsafeCopy 1000 599.51 ns 14.3827 ns 13.4535 ns 0.13 0.00 * 1.2779 - - 4024 B
                       
RotateByRemainderIndexing 10000 47,102.64 ns 1,061.3413 ns 1,993.4563 ns 1.00 0.00 ** 12.6343 - - 40024 B
RotateByArrayCopy 10000 6,383.94 ns 47.8780 ns 42.4426 ns 0.14 0.01 * 12.6572 - - 40024 B
RotateByBufferCopy 10000 6,470.84 ns 127.7785 ns 119.5241 ns 0.14 0.01 * 12.6572 - - 40024 B
RotateBySpanCopy 10000 6,677.21 ns 128.7819 ns 162.8678 ns 0.14 0.01 * 12.6572 - - 40024 B
RotateByUnsafeCopy 10000 6,547.83 ns 126.1428 ns 172.6656 ns 0.14 0.01 * 12.6572 - - 40024 B
                       
RotateByRemainderIndexing 100000 461,920.31 ns 6,744.3065 ns 6,308.6284 ns 1.00 0.00 *** 124.5117 124.5117 124.5117 400024 B
RotateByArrayCopy 100000 73,566.10 ns 1,464.5060 ns 1,686.5273 ns 0.16 0.00 ** 124.8779 124.8779 124.8779 400024 B
RotateByBufferCopy 100000 73,716.92 ns 1,468.5753 ns 2,286.3948 ns 0.16 0.01 ** 124.8779 124.8779 124.8779 400024 B
RotateBySpanCopy 100000 71,036.89 ns 802.9858 ns 626.9185 ns 0.15 0.00 * 124.8779 124.8779 124.8779 400024 B
RotateByUnsafeCopy 100000 73,232.43 ns 1,416.7108 ns 1,325.1922 ns 0.16 0.00 ** 124.8779 124.8779 124.8779 400024 B
                       
RotateByRemainderIndexing 1000000 5,715,133.83 ns 52,297.2313 ns 48,918.8626 ns 1.00 0.00 ** 273.4375 273.4375 273.4375 4000024 B
RotateByArrayCopy 1000000 3,471,175.86 ns 40,279.0739 ns 35,706.3499 ns 0.61 0.01 * 152.3438 152.3438 152.3438 4000024 B
RotateByBufferCopy 1000000 3,484,901.70 ns 75,821.6356 ns 77,863.2374 ns 0.61 0.01 * 152.3438 152.3438 152.3438 4000024 B
RotateBySpanCopy 1000000 3,395,618.79 ns 67,417.6361 ns 184,554.6909 ns 0.57 0.07 * 152.3438 152.3438 152.3438 4000024 B
RotateByUnsafeCopy 1000000 3,492,599.63 ns 52,870.8936 ns 46,868.6701 ns 0.61 0.01 * 152.3438 152.3438 152.3438 4000024 B
                       
RotateByRemainderIndexing 10000000 74,122,529.83 ns 1,659,361.8996 ns 2,037,845.4327 ns 1.00 0.00 ** 111.1111 111.1111 111.1111 40000024 B
RotateByArrayCopy 10000000 36,266,718.27 ns 386,807.3179 ns 361,819.8052 ns 0.49 0.01 *** 133.3333 133.3333 133.3333 40000024 B
RotateByBufferCopy 10000000 33,913,202.44 ns 349,838.7446 ns 327,239.3788 ns 0.46 0.01 * 125.0000 125.0000 125.0000 40000024 B
RotateBySpanCopy 10000000 35,197,233.69 ns 678,249.4610 ns 807,407.7360 ns 0.48 0.02 ** 133.3333 133.3333 133.3333 40000024 B
RotateByUnsafeCopy 10000000 36,101,336.10 ns 719,881.3796 ns 739,265.1758 ns 0.49 0.02 *** 133.3333 133.3333 133.3333 40000024 B
                       
RotateByRemainderIndexing 100000000 650,980,738.30 ns 7,266,152.6154 ns 6,796,763.6647 ns 1.00 0.00 *** - - - 400000024 B
RotateByArrayCopy 100000000 334,371,656.80 ns 6,517,933.2712 ns 11,243,114.6469 ns 0.52 0.01 ** - - - 400000024 B
RotateByBufferCopy 100000000 325,036,779.43 ns 7,624,095.8569 ns 8,157,697.1495 ns 0.50 0.01 * - - - 400000024 B
RotateBySpanCopy 100000000 318,656,163.20 ns 2,968,558.6617 ns 2,776,791.6140 ns 0.49 0.00 * - - - 400000024 B
RotateByUnsafeCopy 100000000 319,322,045.17 ns 2,789,657.9564 ns 2,609,447.7833 ns 0.49 0.00 * - - - 400000024 B

Here are some interesting bits from this benchmark:

  • Just about anything is better than remainder indexing.
  • The mean time of the memory copy methods is an order of magnitude lower than remainder indexing up 100k array size.
  • Said performance takes a hit at around 1M array size. The exact point may be specific to my own computer spec.
  • Copy mean time appears to stabilize at around half of remainder indexing. Again, the exact point may be specific to my own computer spec.
  • The garbage collector goes to sleep when we cross-over the 100M array size mark. That may be because we’re only allocating memory in the large object heap at this point, which goes straight into GC Generation 2.
  • None of the memory copy methods stand-out on the long run - they all behave the same.

In this case, it does not make sense to create a hybrid algorithm… But it does make sense to use one of the copy methods - which one wins, I’ll leave it up to you.

Takeaway

Sometimes one can benefit from over-engineering… Sometimes not… Sometimes both!

Neither of the fancy memory copy methods was significantly more performance than trusty old Array.Copy. Yet even Array.Copy by itself provided significant performance benefits over a naive remainder indexing approach. Why index what you can clone?

Resources

You can find all the code for this post in the Quicker repository, including all the benchmarks.

Jorge Candeias's Picture

About Jorge Candeias

Jorge helps organizations build high-performing solutions on the Microsoft tech stack.

London, United Kingdom https://jorgecandeias.github.io