As we discussed in Part 1/Part 2 about the very simple approaches of migrating the changes from one data table to another, but depending upon the situations, data volume etc., there are a few improvements needed in those approaches.
In this article, we will talk about those limitations of the first approaches and will go through another one taking care of those limitations.
Limitations of part1/part2:
- Large data volume of source/target table(s): In case the table being merged are carrying loads of data, running merge joins on these table will be slower. And in case, there are not so many changes expected at a time, it would again be an overhead of running the checks on the complete data set.
- Large number of changing attributes: In case of tables having good number of columns (say 100 columns) and most of them are potential changing attributes i.e. any of those so many fields can change (from source to target), but we are not sure which one is changed while merging them. And therefore, running OR condition (as mentioned in Part1) to check if there is any change or not, may be so slow that it can result a severe DB server load.
Better approach in case of situations listed above….
- Dealing with large data tables:
If we are expecting not so many changes at a time (from the time merge process is run last), it‘s always good to identify the changed (new, updated) records using some timestamp column.
E.g. Assume if we have a ModifiedDate column into both source & destination table. Then at the beginning of the merge process, first identify the records modified, inserted newly or so; after the last run/merge.
(Note: Here we are assuming that source table is not truncated & reloaded full, otherwise we would need to perform the merge check on complete data-set)
- Get the last run date (modifiedDate) – this can be either be achieved by fetching the MAX(ModifiedDate) from the target table or when merge operation is being performed the current datetime can be stored in terms of some configuration table or so & then to be used while the next run.
- Get all the records from the source table – either into a temp table or so
WHERE Source.ModifiedDate > MAX_ModifiedDate
- Perform the merge operations either using MERGE statement or by means of SQL JOINS from the source’s temp table (with records > Max_ModifiedDate) to the target for the matching records (upon the joining keys).
- Update the target’s modifieddate and/or update the configuration date with the current date of the operation being performed at.
This will cut down the whole process to a limited numbers of records which are new or updated after the last run.
N.B. In case of the soft delete, UPDATE operation will take care of the same. But for the hard delete operation (where records are physically deleted), we would still need to do it on the complete table.
1. Find the changed records more efficiently: Now as mentioned in the previous approaches, when we tried to find out which records have actually changed so that update operation can be performed upon, we used OR statements e.g.
WHERE Source.Col1 <> Target.Col2
Source.Col2 <> Target.Col2
This would be a very slow affair if change is to be looked at very large number of attributes/columns.
There are a few other solutions to deal with it…as to find out the changed records in a more efficient way.
- By using INTERSECT
- By using EXCEPT
Usage of these both the clauses are more or less similar. We will take an example of INTERSECT here.
- First find out the matching records using INTERSECT
–Take the changed records from source table into a temp table – after the last run
SELECT * INTO #T_SourceEmp
WHERE ModifiedDate > MAX_ModifiedDate –MAX(target.ModifiedDate) into a variable
–Get the intersected date
SELECT * INTO #CommonData
FROM dbo.T_DestinationEmployee trg
INNER JOIN #T_SourceEmp src
ON trg.EmployeeID = src.EmployeeID
So, #CommonData contains the records which don’t have any difference from source to destination table (although their modifieddate was changed) and thus to be ignored.
And records which are updated on source but are not into #CommonData table are to be identified as these would be the actual changes.
- Find the updated records and perform the update operation:
SELECT src.* INTO #T_Updates
FROM dbo.T_SourceEmployee src
LEFT JOIN #CommonData CD ON src.EmployeeID = CD.EmployeeID
ModifiedDate > MAX_ModifiedDate –MAX(target.ModifiedDate) into a variable
AND CD.EmployeeID IS NULL –i.e. not existing as common/unchanged data
SET trg.EmployeeName = upd.EmployeeName
,trg.DateOfJoining = upd.DateOfJoining
,trg.Designation = upd.Designation
,trg.ManagerName = upd.ManagerName
FROM dbo.T_DestinationEmployee trg
INNER JOIN #T_Updates upd
ON trg.EmployeeID = upd.EmployeeID