How to Delete Duplicate Rows in MySQL

Sometimes your MySQL data tables may contain duplicate records. Removing duplicate rows is a common practice in data analysis and reporting. This is done to keep the data clean and avoid inaccuracies. Otherwise, they can give wrong results while doing aggregations and summaries. In this article, we will learn several different ways to delete duplicate rows in MySQL. Since we will be using plain SQL queries in most cases, you can also use most of these solutions for other databases also.

Why Delete Duplicate Rows in MySQL

Duplicate data can cause a lot of problems in data analysis and reporting systems. They can mess up row counts, totals, aggregations and summaries. They can also spoil data integrity and hurt its credibility. So it is important to identify and remove duplicate rows in MySQL database, before you process it any further. This is done during data preparation phase of analysis where we identify and remove records containing duplicates, nulls, empty values, incorrect data, etc. Thereafter, it is essential to put database constraints in place to prevent duplicate rows in data.

How to Delete Duplicate Rows in MySQL

Here are the most common ways to delete duplicate rows in MySQL. Let us say you have the following sales table.

mysql> create table sales(id int, product varchar(255),orders int);

mysql> insert into sales(id,product,orders) values
(1,'A',50),
(2,'B',60),
(3,'A',70),
(4,'B',40),
(5,'C',50),
(1,'A',50),
(4,'B',40);

mysql> select * from sales;
+------+---------+--------+
| id | product | orders |
+------+---------+--------+
| 1 | A | 50 |
| 2 | B | 60 |
| 3 | A | 70 |
| 4 | B | 40 |
| 5 | C | 50 |
| 1 | A | 50 |
| 4 | B | 40 |
+------+---------+--------+

1. Using Row_Number

Since MySQL 8+, you can use row_number function within a common table expression (CTE). Row_number() function assigns sequential row number within a partition of duplicates. Here is an example to find duplicate rows in MySQL table using row_number.

WITH cte AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY id,product,orders
ORDER BY id
) AS row_num
FROM
sales
)
select * FROM cte WHERE row_num > 1;

In the above query, MySQL will partition table for unique combinations of id, product and orders column values. In each partition it will assign a unique sequential row number starting from 1. It will store the result in common table expression. If you have duplicate rows in a partition, their row number will be >1. We select such rows from common table expression.

If you want to delete duplicate rows, replace the SELECT * from … query with the following DELETE query.

delete FROM cte WHERE row_num > 1;   

2. Using Intermediate Table

This solution is suitable for large tables that contain many rows. When we call DELETE statement on a table, it has to scan all the rows in the table at least once, to delete duplicate rows. This can be time consuming, as the table size grows in size. In such cases, it is better to create a temporary table and copy unique rows from original table to this temp table. Then we delete the original table and rename temp table to original table. Lastly, we re-create any relations that were present in original table.

Here is the query to create an empty temp_sales table that has the same structure like sales table.

create table temp_sales like sales;

Next, we insert into temp_sales unique rows from sales table.

insert into temp_sales select distinct * from sales;

Next, we drop sales table.

drop table sales;

Lastly, we rename temp_sales table to sales.

rename table temp_sales to sales;

If your original table contains any indexes, then you need to recreate them in your new table.

3. Using DISTINCT Keyword

You can use the DISTINCT keyword to return only unique rows in query result. Here is a sample query to get only distinct rows.

mysql> select distinct * from sales;
+------+---------+--------+
| id | product | orders |
+------+---------+--------+
| 1 | A | 50 |
| 2 | B | 60 |
| 3 | A | 70 |
| 4 | B | 40 |
| 5 | C | 50 |
+------+---------+--------+

Please note, this will not alter the original table. Only the query result will contain unique rows.

4. Using Group By Clause

GROUP BY clause is generally used to aggregate data by grouping rows by one or more columns. If the query does not contain any aggregation functions like sum(), min(), max(), etc. the GROUP BY clause will return the first row of each group, resulting in unique rows.

mysql> select * from sales group by id, product, orders;
+------+---------+--------+
| id | product | orders |
+------+---------+--------+
| 1 | A | 50 |
| 2 | B | 60 |
| 3 | A | 70 |
| 4 | B | 40 |
| 5 | C | 50 |
+------+---------+--------+

Please note, this method also does not alter the original table.

5. Using Having Clause

The HAVING clause is generally used to filter the result of GROUP BY clause. Here is an example to select only those rows which have duplicates, that is, the row count in GROUP BY result >1.

mysql> select id, product, orders, count(*) 
from sales
group by id, product, orders
having count(*)>1;
+------+---------+--------+----------+
| id | product | orders | count(*) |
+------+---------+--------+----------+
| 1 | A | 50 | 2 |
| 4 | B | 40 | 2 |
+------+---------+--------+----------+

6. Using Self-Join & Delete

If your table has a unique primary key column with duplicate values for other columns, then you can simply do a self-join to identify the duplicate rows. Let us say you have the following sales table with id column as primary key, and duplicates in other columns.

create table sales(id int, product varchar(255),orders int);

insert into sales(id,product,orders) values
(1,'A',50),
(2,'B',60),
(3,'A',70),
(4,'B',40),
(5,'C',50),
(6,'A',50),
(7,'B',40);

You can easily filter the duplicate rows using the following inner join query.

select * FROM sales t1
INNER JOIN sales t2
WHERE
t1.id > t2.id AND
t1.product = t2.product AND
t1.orders = t2.orders;

+------+---------+--------+------+---------+--------+
| id | product | orders | id | product | orders |
+------+---------+--------+------+---------+--------+
| 6 | A | 50 | 1 | A | 50 |
| 7 | B | 40 | 4 | B | 40 |
+------+---------+--------+------+---------+--------+

In the above query, we do an inner join and check for duplicates in product and orders column. For this, we check for equality in product and column values in the result of inner join. Thereafter, we filter only those rows where id (primary key) of left table is greater than that of right table.

Avoid Inserting Duplicate Records

Once you have identified and removed duplicate rows from your data, it is essential to add UNIQUE constraint or primary key to prevent insertion of duplicate rows in future. You can do this using the following SQL query on sales table.

ALTER TABLE sales ADD UNIQUE (order_id); 

In the above example, we have set order_id column to be unique. You can add unique constraint for one or more columns.

You can even name the constraint if you want.

ALTER TABLE sales ADD CONSTRAINT uniq_id UNIQUE (order_id); 

Conclusion

It is a common requirement to remove duplicate records in MySQL table, especially in data analysis and reporting systems where you need to clean up data before analyzing it. In this article, we have learnt several ways to remove duplicate records from table in MySQL. If you are using small-medium table, then you can use any of the above solutions but if you are working with large tables then it is recommended to use intermediate table (solution #3). Otherwise, it will take a lot of time to run queries.

Also read:
How to Get Multiple Counts in Single MySQL Query
Top 5 Online Courses to Learn MySQL
MySQL Select Top N Rows Per Group

Leave a Reply

Your email address will not be published. Required fields are marked *