Misc Topics
Delete duplicate rows or records in SQL | SQL Interview Question

Delete duplicate rows or records in SQL | SQL Interview Question

Deleting duplicate rows or records is a popular SQL interview questions but many find it difficult to solve. Let’s try to break it down in this lesson,

Deleting duplicate records involves the below steps:

Step 1: Check if the table contains duplicate values as per your criteria

Step 2: Write the Select query to find the duplicates records

Step 3: Delete the duplicates records found in Step 2

Now let’s explore these steps in detail,

Step 1: Check if the table contains duplicate values as per your criteria

In our example, we have a course table, that has the below records. As we can see, we have many duplicate records (marked in yellow) in this table.

Course table with many duplicate records

Now let’s validate the same using a SQL query.

💡

Here GROUP BY is a good strategy to fetches all unique combinations of course_name, price, and instructor_id and counts how many instances (rows) have the same combination.

  • The below SQL query is used to identify duplicates in the course table based on specific columns (course_name, price, and instructor_id).
  • Then, we addHAVING COUNT(*) > 1 which filters to show only those combinations that appear more than once, which indicates the presence of duplicates in the table.
SELECT course_name, price, instructor_id, count(course_id)
FROM course
GROUP BY course_name, price, instructor_id
HAVING COUNT(*) > 1

The output confirms that the below courses are repeated 3 times.

Course table with many duplicate records

Now that we know, that we have duplicate records in the course table, let’s explore how we can remove these duplicate records.

In this article we will discuss 3 commonly used methods,

Method 1: Using GROUP BY and HAVING clause

Method 2: Using ROW_NUMBER() window function

Method 3: Using Self Join

Method 1: Using GROUP BY and HAVING clause

This query retrieves the minimum course_id for each unique combination of course_name, price, and instructor_id where there are duplicates. Here's how it works:

  • GROUP BY course_name, price, instructor_id: Groups the records by these columns, creating groups of records that have the same combination of these values.
  • HAVING COUNT(*) > 1: Filters the groups to only include those where the count of rows within each group (i.e., the count of duplicates based on the specified columns) is greater than 1.
  • SELECT MIN(course_id): Within each of these groups, it selects the minimum course_id. This gives you the minimum course_id for each set of duplicates based on course_name, price, and instructor_id.
SELECT MIN(course_id)
FROM course
GROUP BY course_name, price, instructor_id
HAVING COUNT(*) > 1

Output:

Method 1 output

We can use the first query as a subquery to delete duplicate records:

  • The subquery inside NOT IN provides a list of minimum course_ids for each group of duplicates based on course_name, price, and instructor_id.
  • The outer DELETE statement removes all rows from the course table where the course_id is not present in the list of minimum course_ids for groups that have duplicates based on the specified columns.
DELETE FROM course
WHERE course_id NOT IN (
SELECT MIN(course_id)
FROM course
GROUP BY course_name, price, instructor_id
HAVING COUNT(*) > 1
);

Method 2: Using ROW_NUMBER() window function

  • Here we Use a window function ROW_NUMBER() within a subquery to generate row numbers partitioned by course_name, price, and instructor_id.
  • The inner subquery assigns a sequential number (row_num) to each row within its respective partition.
  • The outer query selects course_ids where row_num is greater than 1, which indicates the duplicates for the specified combination of columns.
SELECT course_id
FROM (
SELECT course_id,
ROW_NUMBER() OVER(PARTITION BY course_name, price, instructor_id ORDER BY course_id) AS row_num
FROM course
) AS subquery
WHERE row_num > 1

Output:

Method 2 output

DELETE FROM course WHERE course_id IN ( SELECT course_id FROM ( SELECT course_id,
ROW_NUMBER() OVER(PARTITION BY course_name, price, instructor_id ORDER BY course_id)
AS row_num FROM course ) AS subquery WHERE row_num > 1 );

If you want to avoid subquery, we can also use Common Table Expression (CTE) along with window function as shown below,

  • First, create a Common Table Expression (CTE) CTE that uses ROW_NUMBER() to assign sequential numbers partitioned by course_name, price, and instructor_id.
  • Then, use DELETE statement to delete rows from the CTE where row_num is greater than 1 to remove the duplicates.
WITH CTE AS (
SELECT course_id, course_name, price, instructor_id,
ROW_NUMBER() OVER(PARTITION BY course_name, price, instructor_id ORDER BY course_id) AS row_num
FROM course
)
 
DELETE FROM CTE WHERE row_num > 1;

Method 3: Using Self Join

Here we use a self-join on the course table (c1 and c2 are aliases for the same table) to identify duplicate records based on certain criteria (course_name, price, and instructor_id).

Here's how it works:

  • JOIN course c2 ON...: Joins the course table to itself (c2) based on matching course_name, price, and instructor_id, ensuring that c1 refers to a record with a higher course_id than c2.
  • AND c1.course_id > c2.course_id: This condition ensures that only the record with the higher course_id within each pair of duplicates is selected.
  • SELECT c1.course_id: It retrieves the course_id from c1, which represents the duplicates with higher course_id values.
SELECT c1.course_id
FROM course c1
JOIN course c2 ON c1.course_name = c2.course_name
AND c1.price = c2.price
AND c1.instructor_id = c2.instructor_id
AND c1.course_id > c2.course_id

Output:

Method 3 output

We then pass the previous course_id list as a subquery to delete duplicate records from the course table,

DELETE FROM course
WHERE course_id IN (
SELECT c1.course_id
FROM course c1
JOIN course c2 ON c1.course_name = c2.course_name
AND c1.price = c2.price
AND c1.instructor_id = c2.instructor_id
AND c1.course_id > c2.course_id
);