Project Description
A rare tool for rare situations, built only for theorical purposes, that may be useful to transfer a large amount of clustered data from a SQL Server instance to another.


It is usual to copy SQL tables from one instance to another. Usually we likely decide to use the Import/Export wizard, an Integration Services package or a method to transfer data as a text file. But problems may arise if we are talking about tables with more than 300 millions of records or 30gb and more and they must be copied in a minimum time.
If we are lucky this table may have a clustered index and then that's the scenario where this little project play a role.

Its logic is very simple: it scans clustered index statitics and define balanced clusters of data. It means that it creates as many queries as threads parametrized, and those queries filter on the clustered index field trying so as to obtain the same volume of data each one. Then it fires them using the sqlbulkcopy .Net method.

We must advice that depending on the amount of thread it is a very resource intensive process and in some cases CPU compsuntion could rise up to 100% and disk queue length may cause long waits.

Because of this, we recommend running on a different server other than the data source and destination.

Thus, with a structure of 2 networks with two 100mbit network cards in each server, we have achieved 190MB/seg transfer rates.


You should test it for statistical purposes before thinking on using it in a production environment.

Additional notes:
Performance gain will be achieved under certain circunstances, and sometimes it will not be percieved due to the nature of the data inside the clustered column. First of all it's necessary to clarify that clustered index saves statistics information only of the first column of a compound index, so it became more critical that the clustered index be well design. We mean that it'd better to follow the tip of ordering fields from the one with most granularity to the less one.
Another issue that may be encounter is the amount of Null data. Statistics talk about non null data, so if the table has an 80% of null data multithreading will not benefit from it, as all null data is copied in one thread as a unique batch.

This dll could be embebbed into other ETL projects that not depends or that are not built over the SSIS platform.

Last edited Apr 29, 2011 at 7:58 AM by DiegoPerdices, version 5