Replicate/Migrate data from one database/table to another – Part 3

As we discussed in Part 1/Part 2 about the very simple approaches of migrating the changes from one data table to another, but depending upon the situations, data volume etc., there are a few improvements needed in those approaches.
In this article, we will talk about those limitations of the first approaches and will go through another one taking care of those limitations.

Limitations of part1/part2:

  1. Large data volume of source/target table(s): In case the table being merged are carrying loads of data, running merge joins on these table will be slower. And in case, there are not so many changes expected at a time, it would again be an overhead of running the checks on the complete data set.
  2. Large number of changing attributes: In case of tables having good number of columns (say 100 columns) and most of them are potential changing attributes i.e. any of those so many fields can change (from source to target), but we are not sure which one is changed while merging them. And therefore, running OR condition (as mentioned in Part1) to check if there is any change or not, may be so slow that it can result a severe DB server load.

Better approach in case of situations listed above….

  1. Dealing with large data tables:

If we are expecting not so many changes at a time (from the time merge process is run last), it‘s always good to identify the changed (new, updated) records using some timestamp column.

E.g. Assume if we have a ModifiedDate column into both source & destination table. Then at the beginning of the merge process, first identify the records modified, inserted newly or so; after the last run/merge.

(Note: Here we are assuming that source table is not truncated & reloaded full, otherwise we would need to perform the merge check on complete data-set)

  1. Get the last run date (modifiedDate) – this can be either be achieved by fetching the MAX(ModifiedDate) from the target table or when merge operation is being performed the current datetime can be stored in terms of some configuration table or so & then to be used while the next run.
  2. Get all the records from the source table – either into a temp table or so

WHERE Source.ModifiedDate > MAX_ModifiedDate

  1. Perform the merge operations either using MERGE statement or by means of SQL JOINS from the source’s temp table (with records > Max_ModifiedDate) to the target for the matching records (upon the joining keys).
  2. Update the target’s modifieddate and/or update the configuration date with the current date of the operation being performed at.

This will cut down the whole process to a limited numbers of records which are new or updated after the last run.

N.B. In case of the soft delete, UPDATE operation will take care of the same. But for the hard delete operation (where records are physically deleted), we would still need to do it on the complete table.

1. Find the changed records more efficiently: Now as mentioned in the previous approaches, when we tried to find out which records have actually changed so that update operation can be performed upon, we used OR statements e.g.

WHERE Source.Col1 <> Target.Col2

OR

Source.Col2 <> Target.Col2

……..

This would be a very slow affair if change is to be looked at very large number of attributes/columns.

There are a few other solutions to deal with it…as to find out the changed records in a more efficient way.

  1. By using INTERSECT
  2. By using EXCEPT

Usage of these both the clauses are more or less similar. We will take an example of INTERSECT here.

  1. First find out the matching records using INTERSECT

–Take the changed records from source table into a temp table – after the last run

SELECT * INTO #T_SourceEmp

FROM dbo.T_SourceEmployee

WHERE ModifiedDate > MAX_ModifiedDate –MAX(target.ModifiedDate) into a variable

–Get the intersected date

SELECT * INTO #CommonData

FROM

(

SELECT

EmployeeID

,EmployeeName

,DateOfJoining

,Designation

,ManagerName

FROM #T_SourceEmp

INTERSECT

SELECT

trg.EmployeeID

,trg.EmployeeName

,trg.DateOfJoining

,trg.Designation

,trg.ManagerName

FROM dbo.T_DestinationEmployee trg

INNER JOIN #T_SourceEmp src

ON trg.EmployeeID = src.EmployeeID

)

So, #CommonData contains the records which don’t have any difference from source to destination table (although their modifieddate was changed) and thus to be ignored.

And records which are updated on source but are not into #CommonData table are to be identified as these would be the actual changes.

  1. Find the updated records and perform the update operation:

SELECT src.* INTO #T_Updates

FROM dbo.T_SourceEmployee src

LEFT JOIN #CommonData CD ON src.EmployeeID = CD.EmployeeID

WHERE

ModifiedDate > MAX_ModifiedDate –MAX(target.ModifiedDate) into a variable

AND CD.EmployeeID IS NULL  –i.e. not existing as common/unchanged data

UPDATE trg

SET  trg.EmployeeName = upd.EmployeeName

,trg.DateOfJoining = upd.DateOfJoining

,trg.Designation = upd.Designation

,trg.ManagerName = upd.ManagerName

FROM dbo.T_DestinationEmployee trg

INNER JOIN #T_Updates upd

ON trg.EmployeeID = upd.EmployeeID

 

Posted in SQL Server, SQL Server 2008 | Tagged , | Leave a comment

Replicate/Migrate data from one database/table to another – Part 2

To serve the purpose of data synchronisation between two data table; as discussed in Part1, SQL Server 2008 was provided with a new feature of MERGE statement.

In this article we gonna go through the usage of MERGE statement/functionality and then briefly talk about the limitation/issues with that.

I think it’s better to take reference from the MSDN rather than replicating the same thing here again.

So here we go: http://msdn.microsoft.com/en-us/library/bb510625.aspx

A few main excerpts:

  1. Simple in use as UPDATE, INSERT & DELETE can be performed in one pass.
  2. At least one of the three MATCHED clauses must be specified, but they can be specified in any order. A variable cannot be updated more than once in the same MATCHED clause.
  3. The MERGE statement requires a semicolon (;) as a statement terminator. Error 10713 is raised when a MERGE statement is run without the terminator.
  4. When used after MERGE, @@ROWCOUNT (Transact-SQL) returns the total number of rows inserted, updated, and deleted to the client.

Limitations:

  1. For every insert, update, or delete action specified in the MERGE statement, SQL Server fires any corresponding AFTER triggers defined on the target table, but does not guarantee on which action to fire triggers first or last. This may result into some errors or data inconsistency if not handled properly.
  2. Although, source table can be sourced from a query/temp table etc. which can be manipulated as per whether the whole table or a part of the table to be MERGE operation to be performed upon, but the target table is to be taken as the complete table – and therefore it’s almost impossible to identify the actual INSERTES/DELETES; only update can be performed properly in this case.
  3. Not so performance at times and very less scope to tweak the statement to make it perform better.

Issues:

  1. Merge Conflicts / INSERT/UPDATE Race Condition: Let’s take an example where one record is intended to be inserted from source to target table since this particular records based upon the Primary Key/ Joining Key doesn’t exist at the target. The Merge statement under the default isolation level (Read Committed) will find it a potential Insert and will try to insert the same into the target table.

Now consider the scenario where another process tries to run the same merge procedure at the same time. This will also find the same record to be inserted.

But then when both of these processes (running in parallel) would try to insert the same record into the target table – Primary Key Violation error would be reported.

Actually, this shouldn’t be considered a bug. The atomicity of the MERGE transaction isn’t violated — it’s still an all-or-nothing behavior. So to prevent this conflict, a higher isolation level to be considered or table hint like WITH(HOLDLOCK) or                          WITH(SERIALIZABLE) to be used on the target table.

  1. Attempting to set a non-NULL-able column’s value to NULL: Sometimes while using the functions like ISNULL() or any window function into the ON clause of the MERGE statement, this error is seen. Even though we know that source doesn’t contain any NULL value for the non-nullable column it’s updating or inserting to, but still this error is reported at times.

This is a SQL Server bug in the MERGE statement. And there are many theories for the reason of this issue. One of them is that if query optimizer is having table spool in there, chances are high that it would result this error.

 

Posted in Uncategorized | Tagged , | 1 Comment

Replicate/Migrate data from one database/table to another – part 1

This is about a very simple though very useful and frequently used requirement to replicate/merge data from one database/table to another.

There can be various scenarios wherein we do get data from multiple sources into one database/table and then we might be performing some data cleansing, transformation & enrichment processes on these data and then eventually forming a final record-set. This may be happening into some staging area (or into a database not exposed to external world). This is a good practice for multiple reasons but a very basic advantage is to segregate the heavy operations from the data table being accessed by the external users/applications.

So, once all is prepared here locally we need to replicate/migrate it to the external database/table. This comprises of inserting new records, updating the existing records and deleting (hard or soft based upon the requirement) onto the target table.

Requirement: Let me take a very simple requirement to depict the solution around wherein we have a source/staging table which has got some new employees added, a few attributes of some of the existing employees are updated and a few employees have left. We need to merge this to target table.

(Data cleansing, transformation & enrichment etc processes are out of scope here and for the sake of simplicity, we are assuming that both tables are in the same database otherwise if in the different database, we can use the fully qualified name of the table like database.schema.tablename)

SourceTable: dbo.T_SourceEmployee

Target Table: dbo.T_DestinationEmployee

PrimayKey:     EmployeeID

We will talk about the solution/approach first then will try with actual SQL queries.

Approach: So the basic idea is to, 1. Reach out to the records in the source table which don’t exist into the target table, 2. Find out the records which are there in the target table but don’t exist into the source table anymore and 3. Figure out if anything has changed for the records which do exist into both source and the target table

Now, what should be the order of executing these above steps? The best approach is to

  1. Update
  2. Insert
  3. Delete

Reason for this order of execution is that, if we insert first then we would need to operate on for larger record-set for further operations unnecessarily. We can perform the delete operation at the beginning though i.e. delete, update & then insert.

Technical solution: Now, there can be several coding techniques to achieve this outcome. We will discuss one of them and in the next we will talk about how to deal with the shortcomings of this first approach.

This is the approach of making use of JOINS.

  1. UPDATE

UPDATE trg
SET Designation = src.Designation,
ManagerName = src.ManagerName
FROM dbo.T_DestinationEmployee trg
INNER JOIN dbo.T_SourceEmployee src
ON trg.EmployeeID = src.EmployeeID
WHERE src.Designation <> trg.Designation –values likely to be changing
OR src.ManagerName <> trg.ManagerName

2. INSERT

–New records
INSERT INTO dbo.T_DestinationEmployee
(
EmployeeID
,EmployeeName
,DateOfJoining
,Designation
,ManagerName
)
SELECT
src.EmployeeID
,src.EmployeeName
,src.DateOfJoining
,src.Designation
,src.ManagerName
FROM dbo.T_SourceEmployee src
LEFT JOIN dbo.T_DestinationEmployee trg
ON trg.EmployeeID = src.EmployeeID
WHERE trg.EmployeeID IS NULL


3. 
DELETE

–Delete no more valid records
DELETE trg
FROM dbo.T_DestinationEmployee trg
LEFT JOIN dbo.T_SourceEmployee src
ON trg.EmployeeID = src.EmployeeID
WHERE src.EmployeeID IS NULL

There are a few shortcomings to this above approach/solution, which we would be talking of in the forthcoming articles – .

—-Scripts to try with
CREATE TABLE dbo.T_SourceEmployee
(
EmployeeID INT PRIMARY KEY
,EmployeeName VARCHAR(255)
,DateOfJoining DATE
,Designation VARCHAR(255)
,ManagerName VARCHAR(255)
)
GO
CREATE TABLE dbo.T_DestinationEmployee
(
EmployeeID INT PRIMARY KEY
,EmployeeName VARCHAR(255)
,DateOfJoining DATE
,Designation VARCHAR(255)
,ManagerName VARCHAR(255)
)
GO
INSERT INTO dbo.T_SourceEmployee
VALUES(‘101′, ‘John’, ’01Jan2014′, ‘Associate’, ‘Paul’)
,(‘102′, ‘Richard’, ’01Feb2014′, ‘Manager’, ‘Simon’)
,(‘103′, ‘Ridham’, ’10Feb2014′, ‘Software Engineer’, ‘Shankar’)
GO
UPDATE dbo.T_SourceEmployee
SET Designation = ‘Sr. Associate’
WHERE EmployeeID = 101
GO

 

Posted in SQL Server, SQL Server 2008, Uncategorized | 2 Comments

Welcome back !

Hi All,

After being active on my blog for 2 years and then being away from it for good 3.5 years for certain reasons, I am back here !

Over these last 3.5 years, many of my colleagues & visitors suggested me to make a come back, but I think now the time has come to put some stuff which may make someone’s life easier….and before I completely forget all this :)

Hope to get the full support & response from you all !

Cheers !!

Posted in Uncategorized | Leave a comment

Dynamic Named Sets in SQL Server 2008 Analysis Services (SSAS 2008)

Introduction:

SSAS is having a nice & useful feature of defining the calculations (Calculated Members & Named Sets). These calculations can then be used in MDX or by any other client tools etc. These calculations come very handy to allow the user to use them to analyse their data with required parameters, attributes, & custom calculations. Using Named Sets, we can predefine the member(s) which are based upon certain conditions and then user can directly use them to see the measure/data against those members.

Let’s say we want to offer the user with a set of last 5 days/dates which contains transactional data, even though there may be lot of other later dates but with no fact data.

So for this we can create a regular Named Set on Calculation tab, something like:

 CREATE SET ActualSalePeriods AS

Tail

(      NonEmpty([Date Order].[Calendar].[Month].[Date].Members,

                        Measures.[Sales Amount]),

5);  

 Every calculated member/named set created here is session scoped & created with keyword ‘Create’ behind the scene.

 These Named Sets are evaluated and created every time the cube is processed so at the time of next process, if cube gets data with new dates, it will refresh the result set of this create Named Set as well accordingly. So we can say, they are static is nature and contents will remain intact until cube is reprocessed.

Dynamic Named Set:

These regular Named Sets are Static in Nature (only possible option till 2005 version) to improve the query performance by not automatically refreshing/querying it against cube every time. And in fact, works well in the situations like getting latest 5 dates as mentioned above since as in when new dates get added with data in the cube, it will be reprocessed & above set would also be updated with latest members then.

However, this may result a serious issue with respect to the actual requirement which (let’s say) is to get the latest Top 10 employees with highest sales on monthly basis. Created Named Set (static), evaluated at the time of processing, will give the correct result set with those employees. But now in case user wants to slice/dice the result set based upon some other dimension say region, product etc, again user will be offered with the same result set generated earlier at the time of processing, i.e. wrong data information.

 To cope up with such issues/requirements, SSAS 2008 has come up with one more option for creating the Named Set. It’s named as ‘Dynamic’. So developer can either select Static (default) to have the regular Named Set or can select Dynamic to give the Named Set a dynamic behaviour. And by selecting Dynamic option, SSAS will add a keyword ‘Dynamic’ before SET keyword to make it dynamic named set.

Dynamic sets are not calculated once. They are calculated before each & every query, and very important, in the context of that’s query WHERE clause and sub-selects. 

Thus the above requirement of getting Top 10 sales employees will work absolutely fine with Dynamic Named Set.

 Imp: Dynamic named set not only resolves the above mentioned issue but also adds capability to MDX which was not able to generate the desired results with regular named set. This is in regards with the issue while using the multi-select dimensions in the sub-select rather than where clause (which is case with excel wherein it always converts where to sub-select). So using Dynamic sets, multi-select statements are evaluated properly even being used in sub-select statements.

Posted in SQL Server 2008, SSAS 2008 | 6 Comments

How to remove duplicate records from Flat File Source (SSIS)

Problem:

Since populating the data warehouse or any other database may involve various disparate source-types like SQL Server, Excel spreadsheets, DB2 etc. It can also be a Flat File/CSV file source as well. 

In case of data sources like SQL Server, Oracle wherein we generally get the data by querying the tables, views etc, we can actually put the logic to remove the duplicate probably by selecting the DISTINCT values or so. But in case of a Flat File data source we don’t have much control in terms of querying the data from it and still we want the duplicate records to be removed before loading them into the target database/file.

 Solution:

Let’s take an example where we have a flat file with four columns (A,B,C,D) and some records (100,200,300,400 etc) as shown below:

 Source File

 

 

 

 

And now say we want all of the columns to be loading into the target database, but with duplicate records (lines) removed.

Approach:

 

 

 

 

 

 

 

 

 

Steps:

1. Configure the Flat File Source as we required by selecting the file, delimiters (columns & header) etc.

2. Now drag ‘Aggregate’ transformation following Flat File Source and configure it to have ‘Group By’ on the deciding columns/fields for the record to be unique.

3. Configure destination by mapping the columns from Aggregate transformation to the target database/file.

Run the package, and you will see no duplicate record inserted into the target database/file.

Dest File

Posted in SQL Server, SSIS | 10 Comments

How to draw a line based upon data values in Dundas Chart – Customizing Dundas Chart using Code Editor (SSRS)

Introduction:

Dundas has come up with the adding new abilities to Reporting Services (SSRS) in terms of more featured & customizable controls. Not only they look better with lot of UI options to customize, but also have powerful features which are very interesting and useful for analysis. The most popular among these is Dundas Chart control. Existing SSRS charts can also be into Dundas chart just by few right clicks.

Though there are lot of in-built options to select which are quite enough to fulfil the reporting requirements, it also lets developer to apply his/her imagination in designing the chart with the help of writing custom code. And this can be done using the Code Editor of the Dundas control. It provides some event handlers wherein .net (C#/Vb.net) code can be put as per the requirement e.g. PostPaint, PostApplyData, CustomizeChart, CustomizeLegend etc. These are called at some specific point in time & serve a specific purpose.

Drawing a line on Dundas chart for some specific values:

Here we will try to understand the use of Code Editor in Dundas chart with the help of a very simple example. There may be some business need to show a line or any shape based upon certain data conditions on the chart. You need to be a little bit familiar with the .net coding as well.

We will take a simple dataset with some dummy values. 

DataSet: DS_Chart
————————————————
SELECT     1000 AS Value, 1 AS Rate
UNION
SELECT     1400 AS Value, 1.5 AS Rate
UNION
SELECT     2100 AS Value, 2 AS Rate
UNION
SELECT     1450 AS Value, 2.75 AS Rate
UNION
SELECT     2300 AS Value, 3 AS Rate
UNION
SELECT     3400 AS Value, 3.5 AS Rate
UNION
SELECT     2340 AS Value, 4 AS Rate
UNION
SELECT     4440 AS Value, 4.5 AS Rate
UNION
SELECT     4200 AS Value, 5.5 AS Rate
UNION
SELECT     3400 AS Value, 6 AS Rate
UNION
SELECT     9600 AS Value, 6.5 AS Rate
UNION
SELECT     5300 AS Value, 7 AS Rate
UNION
SELECT     5600 AS Value, 9 AS Rate
 Order By Rate
————————————————

The idea is to draw a line chart wherein value (Y-axis) will change overt the rate (X-axis). First we will configure the Dundas Chart control for a basic chart requirement.

  1. Right click the control & select properties.
  2. Give a title to the chart & go to Data tab.
  3. Select ‘DS_Chart’ as dataset.
  4. Add a Value series and there select
    1. rate on X-axis
    2. sum(value) on Y-axis.

  1. 5. Add one CategoryGroup with Expression as Rate & label as also Rate.

6. Click Ok. Go to Preview tab to see the report. It will look a like as:

Now adding the custom code where 3 < Rate > 4 & Value > 3000.

 7. Go to properties and click on Advanced tab.

  1. 8. Click on ViewCode button there.
  2. 9. Select PostPaint event & put the below code inside the code editor.

// Chart Graphics object

ChartGraphics graphics = e.ChartGraphics;              

 

// Check what chart element fired the event

if(sender is Series ){

      Series series = (Series) sender;

               

      //Can also check for the series name using series.Name property

            double xValue;

        double yValue;

       

        // Data points loop

      foreach( DataPoint point in series.Points ){

     

        // Get relative position of the point, which will be the center of the ellipse.

      xValue = graphics.GetPositionFromAxis( “Default”, AxisName.X, point.XValue);

      yValue = graphics.GetPositionFromAxis( “Default”, AxisName.Y, 0);

      // Get absolute point.

      PointF absPoint = graphics.GetAbsolutePoint( new PointF( (float)xValue, (float)yValue ) );

       //Draw a line for data

       if(point.YValues[0] > 3000 && (point.XValue > 3 && point.XValue < 4))

       {

            Pen blackPen = new Pen(Color.Black, 3);

            PointF endPoint = graphics.GetAbsolutePoint( new PointF((float)xValue, (float)40) );

            graphics.DrawLine(blackPen, absPoint,endPoint);

       }

    

      }   

 }

 

10. Click Ok.

11. Now go to Preview tab to see the report. It will look similar to the following:

As shown, a line has been drawn at the specified condition’s location.

Posted in SQL Server, SSRS | 4 Comments