Thursday, April 7, 2016

Comparing two tables with millions of rows

Nowadays the number of the users of our SQL Server tools dealing with very large data sets is growing rapidly and that has been reflected in a significant increase of queries regarding comparing the data in two large databases or even just two very large tables. As a result we decided to explain once more how our SQL Data Compare handles the comparison and what the user can do to speed the process up.

First the basics - here is what happens when you say compare ServerS.DatabaseS.TableS with ServerT.DatabaseT.TableT (I am using S to denote the Source server /database / table and T to denote the Target server /database/ table):

  1. xSQL Data Compare  will connect to both servers and read the table definitions for both sides
  2. It will then try to identify a common unique key (primary key is the first choice) that will be used to pair the rows. At this point it gives the user the opportunity to change the comparison key if the user thinks a different existing key would be better, or even create a custom comparison key. 
  3. Once a comparison key has been selected it's time to compare the two tables:
    1. Read rows from TableS and from TableT and bring them to the machine that is performing the comparison
    2. Pair the rows based on the comparison key
    3. Compare the values for each column row by row
    4. Display the comparison results
  4. Next step is to generate the synchronization script that will make TableT same as TableS
  5. Last step is to execute the synchronization script on the target server.
Now, steps 1 and 2 above are very quick and not taxing at all for any of the systems involved, however, when you reach step 3 is where the "trouble" starts. Let's suppose that tableS and tableT contain around 100 million rows each and each of them is over 10GB in size. You can easily imagine how expensive this operation will be - the performance will depend on the parameters of the servers hosting those databases, parameters of the machine running the comparison and the speed of the connection between the machine running the compare to each of the servers. Here are the things that xSQL Data Compare does to make this kind of comparison possible (note that a lot of the tools in the market will not be able to do this at all):
  1. carefully controls memory usage on the machine running the comparison by processing the data in chunks and serializing the data to disk
  2. dynamically managing the comparison results grid to only load portions of the data at time, on demand. 
No matter how large the tables may be, the xSQL Data Compare will succeed in comparing and synchronizing those tables as long as there is sufficient free disk space on the machine doing the compare. However, understandably, the operation may take a very long time. So what can the user do to speed up the operation? There are two simple but very important steps that the user can take:
  1. exclude from the comparison the columns that may not need to be compared, especially the blob type columns. 
  2. use the "Where Clause" to exclude the rows that do not need to be compared. For example if your table contains a "date updated" time stamp and you know that you synchronized those tables a week ago then you can use the where clause to instruct the Data Compare tool to only compare rows where the "DateUpdated" is > than the last time you synced the tables. This will significantly reduce the amount of work and the process may only take a few minutes instead of a few hours. 
 In order to exclude certain columns and set a "where clause" you would need to drill down the the table pair that is being compared (see the screen shot below)


Post a Comment