Replicate/Migrate data from one database/table to another – Part 3

As we discussed in Part 1/Part 2 about the very simple approaches of migrating the changes from one data table to another, but depending upon the situations, data volume etc., there are a few improvements needed in those approaches.
In this article, we will talk about those limitations of the first approaches and will go through another one taking care of those limitations.

Limitations of part1/part2:

  1. Large data volume of source/target table(s): In case the table being merged are carrying loads of data, running merge joins on these table will be slower. And in case, there are not so many changes expected at a time, it would again be an overhead of running the checks on the complete data set.
  2. Large number of changing attributes: In case of tables having good number of columns (say 100 columns) and most of them are potential changing attributes i.e. any of those so many fields can change (from source to target), but we are not sure which one is changed while merging them. And therefore, running OR condition (as mentioned in Part1) to check if there is any change or not, may be so slow that it can result a severe DB server load.

Better approach in case of situations listed above….

  1. Dealing with large data tables:

If we are expecting not so many changes at a time (from the time merge process is run last), it‘s always good to identify the changed (new, updated) records using some timestamp column.

E.g. Assume if we have a ModifiedDate column into both source & destination table. Then at the beginning of the merge process, first identify the records modified, inserted newly or so; after the last run/merge.

(Note: Here we are assuming that source table is not truncated & reloaded full, otherwise we would need to perform the merge check on complete data-set)

  1. Get the last run date (modifiedDate) – this can be either be achieved by fetching the MAX(ModifiedDate) from the target table or when merge operation is being performed the current datetime can be stored in terms of some configuration table or so & then to be used while the next run.
  2. Get all the records from the source table – either into a temp table or so

WHERE Source.ModifiedDate > MAX_ModifiedDate

  1. Perform the merge operations either using MERGE statement or by means of SQL JOINS from the source’s temp table (with records > Max_ModifiedDate) to the target for the matching records (upon the joining keys).
  2. Update the target’s modifieddate and/or update the configuration date with the current date of the operation being performed at.

This will cut down the whole process to a limited numbers of records which are new or updated after the last run.

N.B. In case of the soft delete, UPDATE operation will take care of the same. But for the hard delete operation (where records are physically deleted), we would still need to do it on the complete table.

1. Find the changed records more efficiently: Now as mentioned in the previous approaches, when we tried to find out which records have actually changed so that update operation can be performed upon, we used OR statements e.g.

WHERE Source.Col1 <> Target.Col2

OR

Source.Col2 <> Target.Col2

……..

This would be a very slow affair if change is to be looked at very large number of attributes/columns.

There are a few other solutions to deal with it…as to find out the changed records in a more efficient way.

  1. By using INTERSECT
  2. By using EXCEPT

Usage of these both the clauses are more or less similar. We will take an example of INTERSECT here.

  1. First find out the matching records using INTERSECT

–Take the changed records from source table into a temp table – after the last run

SELECT * INTO #T_SourceEmp

FROM dbo.T_SourceEmployee

WHERE ModifiedDate > MAX_ModifiedDate –MAX(target.ModifiedDate) into a variable

–Get the intersected date

SELECT * INTO #CommonData

FROM

(

SELECT

EmployeeID

,EmployeeName

,DateOfJoining

,Designation

,ManagerName

FROM #T_SourceEmp

INTERSECT

SELECT

trg.EmployeeID

,trg.EmployeeName

,trg.DateOfJoining

,trg.Designation

,trg.ManagerName

FROM dbo.T_DestinationEmployee trg

INNER JOIN #T_SourceEmp src

ON trg.EmployeeID = src.EmployeeID

)

So, #CommonData contains the records which don’t have any difference from source to destination table (although their modifieddate was changed) and thus to be ignored.

And records which are updated on source but are not into #CommonData table are to be identified as these would be the actual changes.

  1. Find the updated records and perform the update operation:

SELECT src.* INTO #T_Updates

FROM dbo.T_SourceEmployee src

LEFT JOIN #CommonData CD ON src.EmployeeID = CD.EmployeeID

WHERE

ModifiedDate > MAX_ModifiedDate –MAX(target.ModifiedDate) into a variable

AND CD.EmployeeID IS NULL  –i.e. not existing as common/unchanged data

UPDATE trg

SET  trg.EmployeeName = upd.EmployeeName

,trg.DateOfJoining = upd.DateOfJoining

,trg.Designation = upd.Designation

,trg.ManagerName = upd.ManagerName

FROM dbo.T_DestinationEmployee trg

INNER JOIN #T_Updates upd

ON trg.EmployeeID = upd.EmployeeID

 

Advertisements
This entry was posted in SQL Server, SQL Server 2008 and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s