Monday, April 4, 2011

Internals of LINQ

If you remember my post on Annonymous Types, I have already stated that there is no concept of annonymous types in MSIL. Every generated type will be mapped to a concrete type produced by C# compiler during the runtime. In this post, I am going to cover few basics of LINQ, and how it is been maintained in IL. If you are very new to Linq, it is recommended to learn the basic usage of it first. You can read my article on Linq Basics, to know more about it.

LINQ means Language Integrated Query is one of the major step forward to .NET framework to support queries to work on objects. LINQ allows you to write custom query statements on .NET objects or more specifically any IEnumerables to filter or fetch data from it.

In this post, I am not covering the basics as I have already discussed it in a separate article. Lets start with internals.



Simple Projection

To start with Internals, let me create a List of strings on a Type. You should already know, that you can easily replace .NET initializer to allow initialize array into any IEnumerable.

public static List<string> myList = new List<string> 
{ 
"Abhishek", 
"Amit", 
"Abhijit", 
"Kunal", 
"Krishna", 
"Dhananjay" 
};

Yes it will compile, and by your previous knowledge, you think it is initialized the same way if you use an array. But it isnt.

C# compiler automatically replaces the code with proper List.Add statements to add each individual statements. If you see the same code in reflector, you will see a new backup list variable is added to eventually add each strings internally.

var simpleprojection = from i in Program.myList
                                   select i;

Now starting what we call the most basic LINQ query represents to project each individual elements in the list directly into another IEnumerable. Now how does the IL looks like :



Now here you wonder how both the sections as I have discussed looks like. Yes, the initializer introduces a Static constructor internally (created by compiler itself) which adds up each of the strings into the List.  On the other hand, simpleprojection gets an Enumerable from Enumerable.Select which takes an IEnumerable and a delegate.

Internally the Enumerable puts each element from an enumerator and process it using the delegate we pass. Just as our delegate actually returns i. So nothing will be changed. Now let me change the code slightly.

var simpleprojection = from i in Program.myList
                       select new { x = i.StartsWith("A"), y = i.Reverse() };

Hmm, the code looks similar, but here we have yield a new object. Now if you see the IL, it produces a concrete implementation of the class to hold X and Y and eventually creates object of each.

Click to see the Image
Well, yes,  as you can see, the IEnumerable returns an object of Concrete Type. Know more about anonymous types from "Internals of Anonymous Type".

Now coming to our scenarios, the code is same as the other one, just replacing the select statement inside the delegate.

Lets add up a bit more with Let :

var simpleprojection = from i in Program.myList
                        let isStartWithA = i.StartsWith("A")
                        let reverseString = i.Reverse()
                        select new { x = isStartWithA , y = reverseString };

Well, the code works the same with two let statements which assigns the value of each on separate variable isStartWithA and reverseString. But internally everything is now changed.

Hmm, the statement is now replaced with three Enumeration.Select calls nested one within another. Lets break these lines manually to understand what is happening.

var  firstEnumerable = Enumerable.Select(myList, delegate (string i) {
                                                    return new { i = i, isStartWithA = i.StartsWith("A") };
          });
          //Replacing <>f__AnonymousType0<string, bool> fType<string, bool>
          var secondEnumerable = Enumerable.Select(
                                            firstEnumerable, 
                                            delegate (fType<string, bool> obj) {
                                                return new { Fobj = obj, 
                                                                reverseString = obj.i.Reverse<char>() };
                                            });

        //Replacing <>f__AnonymousType1<<>f__AnonymousType0<string, bool> to fType2<fType<string,bool>, IEnumerable<char>>
         var simpleprojection = Enumerable.Select(
                                   secondEnumerable, 
                                    delegate (fType2<fType<string,bool>, IEnumerable<char>> obj) {
                                    return new { x = obj.Fobj.isStartWithA, 
                                                y = obj.reverseString };
                                });

I have intentionally replaced few compiler generated typenames to some other to make you understand code better.

In the first call we pass the list to a delegate which creates a object with variable isStartWith which is the boolean representation of our first let statement.

In the second call, we pass the result of first call. The second delegate generates the reverseString wrapping around the first call object.

Finally, in the outermost call, it will generate the end product.

Now why should you need three nested calls ? Yes there is a reason. Actually when you define the first let statement, you would internally think that from the next statement we should make the variable x available. So the scope of the First let, isStartWith lies on both of the next two lines reverseString and select. Hence we pass on the value in a anonymous type. Thus we can easily use isStartWith on both of these lines.

Similarly on the next successive calls we need to wrap around all the let statements internally into its respective anonymous types until finally we reach out the select. The compiler is smart enough to go on create anonymous types with first property being the parent object and second being the actual let statement.

Hence you can infer, for every let statement in LINQ query, compiler will generate 1 anonymous type at least.

Now lets alter our query to introduce Order By.

var simpleprojection = from i in Program.myList
                        orderby i
                        select i;

If we look into the Reflector for the code we see the code writes the same thing as the first code, the only difference is it is using Enumerable.OrderBy instead of Select and returns back IOrderedEnumerable. The OrderBy actually processes each element using the delegate that we pass, and puts it into a call to OrderedEnumerable, which internally uses EnumerableSorter to get each object from the List.

If you ask me does it internally stores the elements in sorted order after it is being processed, then I think you need to read more about IEnumerable may be from C# iterators. Actually when you enumerate from IOrderedEnumerable, it will internally get the next element from the list which represents the ordering. As IEnumerable internally maintains a state machine, it will save its state until further yield.

Next, lets put a GroupBy.

var simpleGroup = from i in Program.myList
                        group i by i.Count() into g
                        select new { Count = g.Count(), Elements = g };

Well, here we grouped the list based on character count. You should note each Group statement gives you an object (g) which represents the group and you can use g in your select statement.

Now lets see it into reflector:

So you can see here in to code, the C# compiler wraps around a Enumerable.GroupBy inside the Enumerable.Select where the former returns an IEnumerable of IGrouping members. The select statement then uses this IGrouping object to generate its result.

Similar to OrderBy the GroupBy uses GroupedEnumerable object to generate each Group object on each yield. You should note, the object g is internally managed IGrouping object, hence groups does not generate additional Types.

Finally lets see how Join statement looks like. For simplicity I used inner join statement.

var customers = new List<Customer>{ new Customer { Id = 1, Name="Abhishek"},
                                new Customer { Id = 2, Name = "Amit"},
                                new Customer { Id = 3, Name = "Abhijit"},
                                new Customer { Id= 4, Name="Kunal"},
                                new Customer { Id = 5, Name= "Krishna"},
                                new Customer { Id = 6, Name = "Dhananjay"}
};

var orders = new List<Orders>
{
    new Orders { Id = 1, CustomerId = 4, NosOrder=20},
    new Orders { Id = 2, CustomerId = 1, NosOrder=30},
    new Orders { Id = 3, CustomerId = 2, NosOrder=50},
    new Orders { Id = 4, CustomerId = 1, NosOrder=10},
    new Orders { Id = 5, CustomerId = 3, NosOrder=60},
    new Orders { Id = 6, CustomerId = 4, NosOrder=10},
};

var simpleJoin = from c in customers
                join o in orders on c.Id equals o.CustomerId
                select new { c.Name, o.NosOrder };

Well, here I have put one of the most simplest customer and order Join which puts only the equal elements on the result. If you see this on reflector



Hmm, it eventually calls upon the Enumerable.Join statement with delegates which filters the first argument, the  second argument and the result. In case of Joins, for each yield, the first and second delegate is fetched once, and 3rd delegate is called upon for each successful match in the lists. The framework uses JoinIterator internally to iterate between two lists.

For OuterJoins :
var simpleJoin = from c in customers
                join o in orders on c.Id equals o.CustomerId into outer
                from l in outer.DefaultIfEmpty()
                select new { c.Name, l.NosOrder };

It introduces Enumerable.GroupJoin nested inside Enumerable.SelectMany.


The GroupJoin uses GroupJoinIterator to group each individual first delegate with second delegate and each group is selected.

Download Sample - 36KB

Conclusion

I agree, there are lot more to talk about LINQ, but for internals, these puts the basic foundation. It is recommended to go through Reflector a bit more to discover more of the actuals.

I hope the article comes helpful to all those who wants to learn about internals of C#. Try my whole Internal Series.

Thanks for reading.

No comments:

Post a Comment

Please make sure that the question you ask is somehow related to the post you choose. Otherwise you post your general question in Forum section.

Author's new book

Abhishek authored one of the best selling book of .NET. It covers ASP.NET, WPF, Windows 8, Threading, Memory Management, Internals, Visual Studio, HTML5, JQuery and many more...
Grab it now !!!