This article provides a script that you can use to remove duplicate rows from a table in Microsoft SQL Server.
Original product version: SQL
ServerOriginal KB number: 70956
Summary
There are two common methods that you can use to delete duplicate records from a SQL Server table. For the demonstration, start by creating a table and sample data: Create Table
original_table (key_value int ) inserting into original_table values (1) inserting into original_table values (1) inserting into original_table values (1) inserting into original_table values (2) inserting into original_table values (2) inserting into original_table values (2) inserting into original_table values (2)
Then, Try the following methods to remove duplicate rows from the table.
Method
1
Run
the following script: SELECT DIFFERENT * IN duplicate_table OF original_table GROUP BY key_value HAVE COUNT(key_value) > 1 DELETE original_table WHERE key_value IN (SELECT duplicate_table key_value) INSERT original_table SELECT * FROM duplicate_table DROP-DOWN TABLE duplicate_table
This script performs the following actions in the given order:
Moves an instance of any duplicate row from the original table to a duplicate table. Deletes all rows from the original table
- that are also in
- Removes the duplicate table
the duplicate table. Moves the rows from the duplicate table to the original table.
. This method is simple.
However, it requires that you have enough available space in the database to temporarily generate the duplicate table. This method also incurs overhead because it is moving the data.
Also, if the table
has an IDENTITY column, you would have to use SET IDENTITY_INSERT ON when restoring the data to the original table.
Method 2
The ROW_NUMBER function that
was introduced in Microsoft SQL Server 2005 makes this operation much easier
: DELETE T FROM ( SELECT * , DupRank = ROW_NUMBER() OVER ( PARTITION BY key_value ORDER BY (SELECT NULL) ) FROM original_table ) AS T DONDE DupRank > 1
This script performs the following actions in the order listed:
- Use the ROW_NUMBER function to partition data based on the key_value which can be one or more comma-separated columns
- Deletes all records that received a DupRank value greater than 1. This value indicates that the records are duplicated.
.
Because of the expression (SELECT NULL), the script does not sort the partitioned data based on any conditions. If the logic for removing duplicates requires choosing which records to delete and which to keep based on the sort order of other columns, you can use the expression ORDER BY to do so.
Method
2 is simple and effective for these reasons:
It does not require you to temporarily copy duplicate records
- to another table. It does not require
- be joined with itself (for example, by using a subquery that returns all duplicate records using a combination of GROUP BY and HAVING).
- For best performance, you must have a corresponding index in the table that uses the key_value as the index key and includes the sort columns that you used in the ORDER BY expression.
the original table to
However, this method does not work on deprecated versions of SQL Server that do not support the ROW_NUMBER function. In this situation, you should use method 1 or some similar method instead.