Batch or Partition a collection with LINQ

Since Generics were first introduced with .NET 2.0 I have had a utility method I used to batch a list into smaller lists of a fixed size, usually to batch up a bunch of database inserts into manageable groups of 10 or 25 or so.

Today I needed to do the same thing except I will be dealing with a potentially VERY large data set with some fairly complex computations built in. Forcing it all into a list and batching the list means I will have to hold all of that garbage in memory.

So I started looking for a LINQ implementation that would deal with only one batch at a time and keep the memory footprint low. I found very little – everything under a Google search for “LINQ batch” seemed to be about operations with LINQ-to-SQL. Don’t care.

I found Split a collection into n parts with LINQ? on Stack Overflow but was horrified by some of the algorithms there. They all seemed to commit at least one unforgivable sin:

  • Extensive use of division or modulus, although I can let this slide because the question was how to divide an unknown sized collection into X equal parts, as opposed to my desire to return an unknown number of identically-sized parts.
  • Accessed the Count() of the collection
  • SUPER long and/or complex
  • Created tons of intermediate objects.
  • Rendered the collection to a list or something that would cause massive computation

What is needed is simply take a few, and keep track of your place.

Here was my first solution. (DO NOT USE THIS! IT IS BAD!)

public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> collection, int batchSize)
	IEnumerable<T> remaining = collection;
		yield return remaining.Take(batchSize);
		remaining = remaining.Skip(batchSize);

Turns out this is horrible. Apparently you have to be careful when combining the different LINQ operators together. Performance testing showed that the first few batches are returned quickly, but each successive batch returned takes longer and longer to complete, resulting in an algorithm that takes exponentially longer as your collection size increases.

So instead, here is a solution that scales up well. I guess you could call it back to basics.

public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> collection, int batchSize)
	List<T> nextbatch = new List<T>(batchSize);
	foreach (T item in collection)
		if (nextbatch.Count == batchSize)
			yield return nextbatch;
			nextbatch = new List<T>(batchSize);
	if (nextbatch.Count > 0)
		yield return nextbatch;

Sometimes the simplest, shortest, most to-the-point solutions (without any fancy black boxes or tricks) are the best ones.

Related Posts:

  • ct

    Love it. Just what I need thxs!

  • wf

    This is great, exactly what I need. Thank you so much! :grin:

  • Helloworld

    big thumbs up man, just what was needed

  • Mike Brown

    You could use the first sample all you need to do is add a ToList() to remaining.Skip(batchSize)
    Without the ToList call remaining is still the same original Enumerable…you’re just chaining the Skip operators so that by the fifh iteration remaining == collection.Skip(batchSize).Skip(batchSize).Skip(batchSize).Skip(batchSize).Skip(batchSize).

    Your second sample imperatively does the ToList().

  • Guillaume Lecomte

    I did the same thing with LINQ:

    public static IEnumerable<IQueryable> Chunk(this IQueryable source, int chunkSize)
    var index = 0;
    var count = source.Count();
    while (index < count)
    yield return source.Skip(index).Take(chunkSize);
    index += chunkSize;

  • David Boike

    @Mike, adding ToList() will cause the entire enumeration to evaluate. If it contains millions of records, you’re going to have a performance problem – you may even run out of memory.

    @Guillaume, the same problem exists with using .Count() on the source – the enumeration will need to evaluate all the way to the end to get to the count, which will degrade performance as your enumeration size starts to increase.

  • Guest

    Very, very, very nice!

  • Afif

    Beautiful code. This made my day!

  • Digga

    Lovely, I was trying to write this myself as only came across yield last week and knew this was the way to go.

    Many thanks.

  • Eamon Nerbonne

    As a minor improvement you might want to use array’s as opposed to Lists. These are more efficient (and the code’s no more complex!). In general, I try to avoid List whenever I can – it’s almost never necessary when using LINQ. Usually, you want an immutable projection anyway; and list isn’t it. At least an array removes the most common form of mutabilty – adding items. It’s also faster; comes with better syntax for locals, and for helper functions has a clearer, shorter type signature. An array implementation would use something like arr[i++] = val; and for the last batch call Array.Resize.