Problem:
Since populating the data warehouse or any other database may involve various disparate source-types like SQL Server, Excel spreadsheets, DB2 etc. It can also be a Flat File/CSV file source as well.
In case of data sources like SQL Server, Oracle wherein we generally get the data by querying the tables, views etc, we can actually put the logic to remove the duplicate probably by selecting the DISTINCT values or so. But in case of a Flat File data source we don’t have much control in terms of querying the data from it and still we want the duplicate records to be removed before loading them into the target database/file.
Solution:
Let’s take an example where we have a flat file with four columns (A,B,C,D) and some records (100,200,300,400 etc) as shown below:
Source File
And now say we want all of the columns to be loading into the target database, but with duplicate records (lines) removed.
Approach:
Steps:
1. Configure the Flat File Source as we required by selecting the file, delimiters (columns & header) etc.
2. Now drag ‘Aggregate’ transformation following Flat File Source and configure it to have ‘Group By’ on the deciding columns/fields for the record to be unique.
3. Configure destination by mapping the columns from Aggregate transformation to the target database/file.
Run the package, and you will see no duplicate record inserted into the target database/file.
Dest File




I think you wrote this article to solve my problem.
Good tip for real time scenario
Regards
Satheesh
Thanks for your comment, Satheesh !
-Munish Bansal
I created ssis package and job runs every three hours to update the data in ms access, But I need to update multiple tables but what what i notice after setting the job. All tables are not getting updated accordingly.Sql server job did run successfully but not updating all the tables. Can you please provide me some guidence what i am doing wrong.
Hi,
I want to remove duplicates in a package.I would like to know which transformation should i use.
hi sudha,
for avoiding duplicates we can use sort transformation but it is asynchronous transformation.
nice article, really helped a lot
This was really helpful. Solved my problem.
Hello there!
Sort and Aggregate transoformations are easy way to remove the duplicate records from the flat file as source. Using Aggregate transformation Munish has already given you a solution. Using Sort you can also do the same by checking the checkbox “Remove rows with duplicate sort values”. However, there are disadvantages to these transformations if your source data set is huge, because these are fully blocking transformations and creates additional buffers in the data flow engine pipeline. Alternatively what you can do is to stage the flat file data into a temporary table and use T-SQL to get only distinct values. But if your source data source is small then first two options are easy and quick.
Thanks,
Raj