Duplicate records waste time, space, and money. Learn how to find and correct duplicate values using the SQL GROUP BY and HAVING clauses.
The best way to practice the GROUP BY and HAVING clauses is LearnSQL.com interactive SQL Practice Set course. It contains more than 80 hands-on exercises to allow you to practice different SQL constructs, from simple single-table queries, through aggregation, grouping, and joining data from multiple tables, to complex topics such as subqueries.
Database best practices typically dictate having unique constraints (such as the primary key) on a table to avoid row duplication when extracting and consolidating data. However, you may find yourself working on a dataset with duplicate rows. This could be due to human error, application error, or uncleaned data that has been extracted and merged from external sources, among other things.
Why correct duplicate values? They can spoil the stones. They can even cost a company money; For example, an e-commerce business may process duplicate customer orders multiple times, which can have a direct impact on the company’s bottom line.
In
this article, we will discuss how you can find those duplicates in SQL using the GROUP BY and HAVE clauses. How to Find Duplicate Values in SQL First
,
you’ll need to define criteria for detecting duplicate rows. Is it a combination of two or more columns where you want to detect duplicate values, or are you just looking for duplicates within a single column?
In the
examples below, we’ll explore both scenarios using a simple customer order database
.
In terms of the general approach to either scenario, finding duplicate values in SQL comprises two key steps:
- Use the GROUP BY clause to group all rows by the target column(s), that is, the column(s) in which you want to check for duplicate values.
- Using the COUNT function in the HAVING clause to check if any of the groups have more than 1 entry; Those would be the duplicate values.
For a quick visual refresher on GROUP BY, check out our SQL GROUP BY video from We Learn SQL Series. Our SQL Practice Set course offers over 80 hands-on SQL exercises to practice these concepts in great detail.
Duplicate
values in a column
Here, we will demonstrate how you can find duplicate values in a single column. For this example, we’ll use the Orders table, a modified version of the table we used in my previous article on using GROUP BY in SQL. Below is a sample of the table.
OrderIDCustomerIDEmployeeIDOrderDateShipperID 102489051996-07-043 102498161996-07-051 102503441996-07-082 102518431996-07-081 102518431996-07-081 102527641996-07-092 …………… 104436681997-02-121
In this example, there are some duplicates in the OrderID column. Ideally, each row should have a unique value for OrderID, as each individual order is assigned its own value. For some reason, that wasn’t implemented here. To find
the duplicates, we can use the following query: SELECT OrderID, COUNT(OrderID) FROM Orders GROUP BY OrderID HAVE COUNT(OrderID) > 1
RESULT
Number of records: 2
OrderIDCOUNT(OrderID) 102512 102762
As we can see, OrderID 10251 (which we saw in the sample in the table above) and OrderID 10276 have
duplicates.
The use of the GROUP BY and HAVING clauses can clearly show duplicates in your data. Once you have validated that the rows are the same, you can choose to remove duplicates by using the DELETE statement.
Duplicate values in multiple
columns
You are often interested in finding rows in which a combination of a few columns matches. For this example, we’ll use the OrderDetails table, a sample of which is shown below.
OrderDetailIDOrderIDProductIDQuantity 1102481112 2102484210 310248725 410249149 510249142 6102495140 ………… 520104432812
We want to find entries where the OrderID and ProductID columns are identical. This type of duplicate probably means that there is an error in the ordering system, as each order will process each product in that order only once in the cart. If multiple quantities of that product are ordered, the value of the quantity will simply increase; Separate (duplicate) rows should not be created. Such a failure can adversely affect business operations if orders are fulfilled, packaged, and shipped automatically.
To find duplicates in various column values, we can use the following query. It is very similar to the single column
: SELECT OrderID, ProductID, COUNT(*) FROM OrderDetails GROUP BY OrderID, ProductID HAVING COUNT(*) >
1 RESULT
Number of records: 2
Above, we can confirm that the ordering system has an error. Like the first example that uses a single column, this second example allows us to find errors in the ordering system. In this case, the products are registered as a new order even though they have been added to the same cart by the same customer. Now you, as the business owner, can take appropriate corrective action to rectify this error in your order management system.
Note that previously, we used COUNT(*) and not a column-specific counter like COUNT(OrderID). COUNT(*) counts all rows, while COUNT (Column) counts only non-null values in the specified column. However, in this example, it won’t have made a difference in any way: there were no null values in either of the two columns that were grouped together.
Dealing with
duplicate values
Finding duplicates in SQL is primarily about quality/rationality checks and data validation. These controls are often applied to the day-to-day operations of many small and medium-sized businesses.
Also, this is a very common interview question for data science/analyst roles! So, it’s great that you now know the basics of how you can approach this question. Still, it goes without saying that you’ll definitely need more practice to clearly see the nuances presented by the uniqueness of each dataset and what criteria you should apply for those rationality and quality checks.
To have a better handle on duplicate records, I would definitely recommend LearnSQL’s SQL Basics course, which covers these concepts holistically, with a full set of hands-on exercises.