How to remove duplicate records from Flat File Source (SSIS)

Problem:

Since populating the data warehouse or any other database may involve various disparate source-types like SQL Server, Excel spreadsheets, DB2 etc. It can also be a Flat File/CSV file source as well. 

In case of data sources like SQL Server, Oracle wherein we generally get the data by querying the tables, views etc, we can actually put the logic to remove the duplicate probably by selecting the DISTINCT values or so. But in case of a Flat File data source we don’t have much control in terms of querying the data from it and still we want the duplicate records to be removed before loading them into the target database/file.

 Solution:

Let’s take an example where we have a flat file with four columns (A,B,C,D) and some records (100,200,300,400 etc) as shown below:

 Source File

 

 

 

 

And now say we want all of the columns to be loading into the target database, but with duplicate records (lines) removed.

Approach:

 

 

 

 

 

 

 

 

 

Steps:

1. Configure the Flat File Source as we required by selecting the file, delimiters (columns & header) etc.

2. Now drag ‘Aggregate’ transformation following Flat File Source and configure it to have ‘Group By’ on the deciding columns/fields for the record to be unique.

3. Configure destination by mapping the columns from Aggregate transformation to the target database/file.

Run the package, and you will see no duplicate record inserted into the target database/file.

Dest File

About these ads
This entry was posted in SQL Server, SSIS. Bookmark the permalink.

10 Responses to How to remove duplicate records from Flat File Source (SSIS)

  1. Satheesh says:

    I think you wrote this article to solve my problem.

    Good tip for real time scenario

    Regards
    Satheesh

  2. Thanks for your comment, Satheesh !

    -Munish Bansal

  3. Diva says:

    I created ssis package and job runs every three hours to update the data in ms access, But I need to update multiple tables but what what i notice after setting the job. All tables are not getting updated accordingly.Sql server job did run successfully but not updating all the tables. Can you please provide me some guidence what i am doing wrong.

  4. sudha says:

    Hi,
    I want to remove duplicates in a package.I would like to know which transformation should i use.

  5. nageswararao says:

    hi sudha,
    for avoiding duplicates we can use sort transformation but it is asynchronous transformation.

  6. nice article, really helped a lot

  7. Pranav Kapadia says:

    This was really helpful. Solved my problem.

  8. Raj says:

    Hello there!
    Sort and Aggregate transoformations are easy way to remove the duplicate records from the flat file as source. Using Aggregate transformation Munish has already given you a solution. Using Sort you can also do the same by checking the checkbox “Remove rows with duplicate sort values”. However, there are disadvantages to these transformations if your source data set is huge, because these are fully blocking transformations and creates additional buffers in the data flow engine pipeline. Alternatively what you can do is to stage the flat file data into a temporary table and use T-SQL to get only distinct values. But if your source data source is small then first two options are easy and quick.

    Thanks,
    Raj

  9. sudhakar says:

    Raj can u give brief explanation regarding staging the data in temp table and how to remove duplicate records from flat file.

  10. Pingback: Removing Duplicate Rows With SSIS | Sarah.C

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s