Tuesday, March 31, 2009

Generic collections of value types

When we ported my company's codebase from .NET 1.1 to 2.0 I was excited about being able to use generic collections. As I was looking for guidance on when and how to use them I remember reading the following in the MSDN documentation for List<T>:
Performance Considerations
In deciding whether to use the List<(Of <(T>)>) or ArrayList class, both of which have similar functionality, remember that the List<(Of <(T>)>) class performs better in most cases and is type safe. If a reference type is used for type T of the List<(Of <(T>)>) class, the behavior of the two classes is identical. However, if a value type is used for type T, you need to consider implementation and boxing issues.
If a value type is used for type T, the compiler generates an implementation of the List<(Of <(T>)>) class specifically for that value type. That means a list element of a List<(Of <(T>)>) object does not have to be boxed before the element can be used, and after about 500 list elements are created the memory saved not boxing list elements is greater than the memory used to generate the class implementation.
So based on that we tried to use List<T> for reference types (objects) but rarely for value types (ints, strings, longs, etc). Now the other day I was reading the excellent book C# in Depth by Jon Skeet (which I highly recommend) and came across the following:
Let’s start with a simple situation first, with a single type parameter—we’ll use List<T> for the sake of convenience. The JIT creates different code for each value type argument—int, long, Guid, and the like—that we use. However, it shares the native code generated for all the closed types that use a reference type as the type argument, such as string, Stream, and StringBuilder. It can do this because all references are the same size (they are 4 bytes on a 32-bit CLR and 8 bytes on a 64-bit CLR, but within any one CLR all references are the same size). An array of references will always be the same size whatever the references happen to be. The space required on the stack for a reference will always be the same. It can use the same register optimizations whatever type is being used—the List<Reason> goes on.
Each of the types still has its own static fields as described in section 3.4.1, but the code that is executed is reused. Of course, the JIT still does all of this lazily—it won’t generate the code for List<int> before it needs to, and it will cache that code for all future uses of List<int>
After reading that I posted a question to the book’s forum and confirmed with the author what should have made sense to me all along. Yes the JIT creates a new class for List<int> but it reuses that class for all other variables of that type. So while the MSDN article is technically correct, for any reasonable size program it’s a red herring. We incur the memory overhead once, not for every invocation of List<int>. For the common value types (int, long, DateTime, double) we should be using generic collections.
Again from the book:
However, the performance benefits of generics can be huge, and again that comes down to having the opportunity to JIT to different code for different types. Consider a List<byte>, for instance. In .NET 1.1, adding individual bytes to an ArrayList would have meant boxing each one of them, and storing a reference to each boxed value. Using List<byte> has no such impact—List<T> has a member of type T[] to replace the object[] within ArrayList, and that array is of the appropriate type, taking the appropriate space. So List<byte> has a straight byte[] within it used to store the elements of the array. (In many ways this makes a List<byte> behave like a MemoryStream.)
Figure 3.3 shows an ArrayList and a List<byte>, each with the same six values. (The arrays themselves have more than six elements, to allow for growth. Both List<T> and ArrayList have a buffer, and they create a larger buffer when they need to.)
The difference in efficiency here is incredible. Let’s look at the ArrayList first, considering a 32-bit CLR.8 Each of the boxed bytes will take up 8 bytes of object overhead, plus 4 bytes (1 byte, rounded up to a word boundary) for the data itself. On top of that, you’ve got all the references themselves, each of which takes up 4 bytes. So for each byte of useful data, we’re paying at least 16 bytes—and then there’s the extra unused space for references in the buffer.
Compare this with the List<byte>. Each byte in the list takes up a single byte within the elements array. There’s still “wasted” space in the buffer, waiting to be used potentially by new items—but at least we’re only wasting a single byte per unused element there.
We don’t just gain space, but execution speed too. We don’t need the time taken to allocate the box, the type checking involved in unboxing the bytes in order to get at them, or the garbage collection of the boxes when they’re no longer referenced.


Final Recommendation

  • Use Generic collections for collections of common value types (int/long/double/DateTime) wherever possible.
  • Evaluate use of generics on a case by case basis for less used value types (uint/byte/bool/etc).
  • Fix up existing collections as you write new code
Also here is a good guide to replacing a Hashtable with a Generic Dictionary. It has a table showing the equivalent members in the Dictionary class from the Hashtable couterpoints.

No comments: