Delete duplicate rows or records in SQL | SQL Interview Question
Deleting duplicate rows or records is a popular SQL interview questions but many find it difficult to solve. Let’s try to break it down in this lesson,
Deleting duplicate records involves the below steps:
Step 1: Check if the table contains duplicate values as per your criteria
Step 2: Write the Select
query to find the duplicates records
Step 3: Delete
the duplicates records found in Step 2
Now let’s explore these steps in detail,
Step 1: Check if the table contains duplicate values as per your criteria
In our example, we have a course table, that has the below records. As we can see, we have many duplicate records (marked in yellow) in this table.
Now let’s validate the same using a SQL query.
Here GROUP BY
is a good strategy to fetches all unique combinations of course_name
, price
, and instructor_id
and counts how many instances (rows) have the same combination.
- The below SQL query is used to identify duplicates in the
course
table based on specific columns (course_name
,price
, andinstructor_id
). - Then, we add
HAVING COUNT(*) > 1
which filters to show only those combinations that appear more than once, which indicates the presence of duplicates in the table.
SELECT course_name, price, instructor_id, count(course_id)
FROM course
GROUP BY course_name, price, instructor_id
HAVING COUNT(*) > 1
The output confirms that the below courses are repeated 3 times.
Now that we know, that we have duplicate records in the course
table, let’s explore how we can remove these duplicate records.
In this article we will discuss 3 commonly used methods,
Method 1: Using GROUP BY and HAVING clause
Method 2: Using ROW_NUMBER() window function
Method 3: Using Self Join
Method 1: Using GROUP BY and HAVING clause
This query retrieves the minimum course_id
for each unique combination of course_name
, price
, and instructor_id
where there are duplicates. Here's how it works:
GROUP BY course_name, price, instructor_id
: Groups the records by these columns, creating groups of records that have the same combination of these values.HAVING COUNT(*) > 1
: Filters the groups to only include those where the count of rows within each group (i.e., the count of duplicates based on the specified columns) is greater than 1.SELECT MIN(course_id)
: Within each of these groups, it selects the minimumcourse_id
. This gives you the minimumcourse_id
for each set of duplicates based oncourse_name
,price
, andinstructor_id
.
SELECT MIN(course_id)
FROM course
GROUP BY course_name, price, instructor_id
HAVING COUNT(*) > 1
Output:
We can use the first query as a subquery to delete duplicate records:
- The subquery inside
NOT IN
provides a list of minimumcourse_id
s for each group of duplicates based oncourse_name
,price
, andinstructor_id
. - The outer DELETE statement removes all rows from the
course
table where thecourse_id
is not present in the list of minimumcourse_id
s for groups that have duplicates based on the specified columns.
DELETE FROM course
WHERE course_id NOT IN (
SELECT MIN(course_id)
FROM course
GROUP BY course_name, price, instructor_id
HAVING COUNT(*) > 1
);
Method 2: Using ROW_NUMBER() window function
- Here we Use a window function
ROW_NUMBER()
within a subquery to generate row numbers partitioned bycourse_name
,price
, andinstructor_id
. - The inner subquery assigns a sequential number (
row_num
) to each row within its respective partition. - The outer query selects
course_id
s whererow_num
is greater than 1, which indicates the duplicates for the specified combination of columns.
SELECT course_id
FROM (
SELECT course_id,
ROW_NUMBER() OVER(PARTITION BY course_name, price, instructor_id ORDER BY course_id) AS row_num
FROM course
) AS subquery
WHERE row_num > 1
Output:
DELETE FROM course WHERE course_id IN ( SELECT course_id FROM ( SELECT course_id,
ROW_NUMBER() OVER(PARTITION BY course_name, price, instructor_id ORDER BY course_id)
AS row_num FROM course ) AS subquery WHERE row_num > 1 );
If you want to avoid subquery, we can also use Common Table Expression (CTE) along with window function as shown below,
- First, create a Common Table Expression (CTE)
CTE
that usesROW_NUMBER()
to assign sequential numbers partitioned bycourse_name
,price
, andinstructor_id
. - Then, use DELETE statement to delete rows from the CTE where
row_num
is greater than 1 to remove the duplicates.
WITH CTE AS (
SELECT course_id, course_name, price, instructor_id,
ROW_NUMBER() OVER(PARTITION BY course_name, price, instructor_id ORDER BY course_id) AS row_num
FROM course
)
DELETE FROM CTE WHERE row_num > 1;
Method 3: Using Self Join
Here we use a self-join on the course
table (c1
and c2
are aliases for the same table) to identify duplicate records based on certain criteria (course_name
, price
, and instructor_id
).
Here's how it works:
JOIN course c2 ON...
: Joins thecourse
table to itself (c2
) based on matchingcourse_name
,price
, andinstructor_id
, ensuring thatc1
refers to a record with a highercourse_id
thanc2
.AND c1.course_id > c2.course_id
: This condition ensures that only the record with the highercourse_id
within each pair of duplicates is selected.SELECT c1.course_id
: It retrieves thecourse_id
fromc1
, which represents the duplicates with highercourse_id
values.
SELECT c1.course_id
FROM course c1
JOIN course c2 ON c1.course_name = c2.course_name
AND c1.price = c2.price
AND c1.instructor_id = c2.instructor_id
AND c1.course_id > c2.course_id
Output:
We then pass the previous course_id list as a subquery to delete duplicate records from the course table,
DELETE FROM course
WHERE course_id IN (
SELECT c1.course_id
FROM course c1
JOIN course c2 ON c1.course_name = c2.course_name
AND c1.price = c2.price
AND c1.instructor_id = c2.instructor_id
AND c1.course_id > c2.course_id
);