Friday, December 31, 2010

ADO.NET : Some internals uncovered Part 2

For the last 2 days, I thought of unleashing few of the hidden facts of Data Storage. It is true, that introduction of .NET framework without introducing Generic in first place is one of the biggest mistakes. Because of Non-Generic data structure a large amount of data is been unnecessarily boxed and unboxed without any reason as such. After Generics was introduced  with .NET 2.0, most of these classes which made you do type conversions is either been totally depreciated or few of them rarely used as is there for backward compatibility. But for every programmer, the most common class called DataTable still exists without any change. In this post we will try to see the actual implementation of DataTable as a whole, so that we could understand what is going on when we store data into it.

Introduction

DataTable is an application object that maps to a Database Table. It holds a collection of DataRows and DataColumns where DataRow represents the actual Data while the DataColumns holds Integration rules. Basically we all know the basics of DataTable and often use it for our regular work.

DataSet holds a list of DataTable, and DataRelation. DataTable as being said is the storage of Data in form of a Table with Row and Columns, the DataRelation holds the Relationship rules between two tables in a DataSet. Hence ADO.NET has been structured in such a way that you can have the flavour of entire database from your application end.

But, one thing that you must keep in mind, these objects are created way back in time of .NET 1.0 release or even before when there is no Generics or even the concept of State Machines. Generics allows you to pass type of an object during actual object creation and hence will allow you to create Type information based on the object which we create in real time, as a result it eliminates unnecessary runtime boxing and unboxing. Keeping this in mind, I thought I should check how these classes are made in Framework. Lets look each of them, one by one in this post and lets identify pros and cons.



DataSet

DataSet is the application object that can hold the entire structure of the database schema. Basically, internally it is a serializable object which holds two set of collection. DataTableCollection and DataRelationCollection where the former is an implementation of an ArrayList of DataTables and the later is a list of DataRelation. DataSet can be Cloned (Produces another object with New tables from existing schema) or Copied (Produces another object with same tables)

Some of the important benefits of DataSet 
  1. Easily Serializable, even could be written to XML File using WriteXML or can be read to the Object.
  2. Allows you to Merge other DataSet into it easily.
  3. Can hold Relation and Tables into it.
Problems & best implementation of DataSet
  1. Can only hold a collection of DataTable, and it does not allow us to define our own data structures and produce custom DataRelation among them.
  2. No Generic support, use of ArrayList means each element has to be upcast to object when being stored and again to actual element (DataTable) when being retrieved. 

DataTable

DataTable is the primary object which holds the Data. It holds a collection of DataColumn and DataRow. Each of the DataRow holds the data element while DataColumn holds the rules between each columns. The DataRowCollection internally represents an ArrayList of DataRow while DataColumnCollection is on DataColumns.

To hold the DataRows, a DataTable maintains an internal sealed class called RecordManager. Everytime you create a DataRow element inside a DataTable class, the object is stored into RecordManager. The DataRowCollection holds a list of DataRow. Here RBTree class implemented as Generic list of DataRow objects is used.

Some important benefits of DataTable

  1. DataTable can store any Table even if the data is complex. 
  2. Serializable, so can be ported in communication medium easily.
  3. Keeps track of indexed field.
Problems of DataTable

  1. Most of the objects are non-generic, and hence leading to conversion of types.
  2. Very bulky.

DataRow


DataRow on the other hand holds the actual data. It has an indexer to the object which lets you get or set data to it.

Internals to ADO.NET Continued....

Now let us take a look at how the data being stored into these data structures and later retrieved from it. Say you have opened a connection to database and used Adapter.Fill to fill data into a DataTable. Now lets examine what internally happens to store Data into the DataTable and also retrieving the same when required. Say the code I write to retrieve the Data is as below :

SqlDataAdapter adpt = new SqlDataAdapter("Select * from Table", conn);
adpt.Fill(DTable);


Now when you call Fill method, it actually invokes the external SQL server process to load the Data into memory and parses the loaded memory into the existing data structure. If you have already read my last post on internals of ADO.NET,  you should be clear how the DataRows are created using DataReader and put into DataRowCollection. In this post I will cover from there, how the interal DataRow is parsed.

The DataReader loads each DataRow into memory and takes you through with calling LoadDataRow method of SchemaMapping. If you see into Reflector the code for LoadDataRow looks like :




Here the initial call to dataReader.GetValues is actually getting the data from memory into DataRow, which is then Loaded into DataTable and finally to the DataSet. Now if you try to look into the defination of GetValues you need to look into the actual implementation of it in the provider specific classes. For SqlClient, the data is loaded using SqlBuffer class. The Buffer reads each individual entries and puts it as an array. So basically for a DataRow object loading Data for Sql Server, it puts the actual Data into an array of SqlBuffer. So if you can see the implementation of SqlBuffer, you would be clear about the actual data storage of DataRow.


So if you look into the implementation of SqlDataReader, you can see it basically maintains the collection of SqlBuffer. SqlBuffer is the actual unit of data storage. It maintains a structure Storage in it which has member to store almost all data basic data types The Storage stores the actual data in Type  and the SqlBuffer holds the exact type information.


Here inside the SqlBuffer you can see the member _value as Storage. StorageType on the other hand is an Enum which sets the actual type of Data associated in the DataRow.  The SqlBuffer also exposes properties, each of which can get or set the value for the object. The exact type information is also held into DataColumnCollection and according to that the Type of the Data is parsed.

Hence, this means the DataRow is actually stored internally into proper types and hence does not have any performance implication, right? No, there is.

Actually the performance implication lies above these fact when you try to retrieve the value or store it into the DataRow object. We generally use indexer to fetch data from the DataRow. The indexer to the DataRow is  actually of type object.

public object this[]
{ 
     get; set;
}

This is the major problem to DataRow class. So eventually, every time you try to retrieve data from DataRow, even though proper type is maintained underneath, the indexer box to object type. Also, upon using setter, the proper type maintained from our code is also being boxed to the setter and which again unboxed to proper data type.

Conclusion

I think these classes could be generalized more to induce better performance to the API. I am also hoping to have an enumeration of generic objects in the whole data structure of DataRow and DataColumns (may be using DataCell<t>) in future.

I hope you like the post.

I don't know if I did made you things clear? I have to put only those which are important to know.  I would suggest to use Reflector to see the implementation in detail to know further.

Thank you for reading. Looking for your feedback.

No comments:

Post a Comment

Please make sure that the question you ask is somehow related to the post you choose. Otherwise you post your general question in Forum section.

Author's new book

Abhishek authored one of the best selling book of .NET. It covers ASP.NET, WPF, Windows 8, Threading, Memory Management, Internals, Visual Studio, HTML5, JQuery and many more...
Grab it now !!!