Last updated on May 20th, 2025 at 07:10 am
Python Pandas is commonly used to store and analyze data. Duplicate rows or column values is one of the common problems faced during data analysis. Python developers need to find and remove duplicates from their data. You can easily do this using duplicated() function, available in every Pandas dataframe. In this article, we will learn how to find duplicates in Python pandas dataframe.
Why Find Duplicates in Python Pandas Dataframe
When we receive data from another source, it may already contain entirely duplicate rows of information. Sometimes, only certain columns may contain duplicate values. Data entry errors can also result in presence of duplicates in data. If you happen to merge or join dataframes or datasets, then the result may contain duplicates. So Python developers will need to identify these duplicate values so that they can modify or remove them from your dataframe. This is an essential aspect of data preparation.
How to Find Duplicates in Python Pandas Dataframe
There are several simple ways to find duplicates in Python pandas. Every Pandas dataframe supports duplicated() function that allows you to identify duplicates, for different use cases. Here is its syntax.
DataFrame.duplicated(subset=None, keep='first')
The duplicated() function basically finds and marks all duplicate rows with a Boolean flag to indicate whether each rows is a duplicate or not. It accepts two arguments – subset and keep. Both are optional. By default duplicated() function will remove duplicates based on all columns. If you want to select rows that contain duplicate values only for certain columns, then you can specify them in subset argument. Keep argument determines which duplicate row to retain. It takes first, last and False values to keep the first, last and no rows.
Let us say you have the following Pandas dataframe.
import pandas as pd
data = {'Name': ['John', 'Jane', 'John', 'Joe'],
'Age': [28, 22, 28, 22],
'City': ['New York', 'Paris', 'New York', 'London']}
df = pd.DataFrame(data)
print(df)
## output
Name Age City
0 John 28 New York
1 Jane 22 Paris
2 John 28 New York
3 Joe 22 London
1. Find duplicates based on all columns
By default, calling duplicated() function on a dataframe will remove flag duplicate rows. Here is an example.
duplicate = df.duplicated()
print(duplicate)
## output
0 False
1 False
2 True
3 False
dtype: bool
In the above output, duplicated() function returns Boolean (True/False) values for each row. True value means the row is a duplicate row, and False value means it is not a duplicate row. If you want the actual duplicate row, then you need to pass the result of duplicated() function again to the dataframe, as shown.
duplicate = df[df.duplicated()]
print(duplicate)
## output
Name Age City
2 John 28 New York
2. Find duplicates based on single column
Sometimes, only a single column of dataframe may contain duplicate values. Or you may want to identify only those rows that contain duplicate values for a specific column. For this purpose, you need to specify the column name as subset argument. Here is an example to find rows with duplicate values for ‘Age’ column.
duplicate = df[df.duplicated(subset='Age')]
OR
duplicate = df[df.duplicated('Age')]
print(duplicate)
## output
Name Age City
2 John 28 New York
3 Joe 22 London
3. Find duplicates based on multiple columns
Similarly, sometimes you may need to find duplicates based on multiple columns. In this case, you need to mention the list of column names in subset argument.
duplicate = df[df.duplicated(subset=['Name','Age'])]
OR
duplicate = df[df.duplicated(['Name','Age'])]
print(duplicate)
## output
Name Age City
2 John 28 New York
4. Get duplicate last rows
In each of the above use cases, duplicated() function will keep the first row and flag all the other rows as duplicates. What if you want to retain the last duplicate row instead? In this case, we can use the keep argument. Here is an example, where we find duplicates for ‘Age’ column but keep only the last row for each group of duplicates.
duplicate = df[df.duplicated(subset='Age',keep='last')]
print(duplicate)
## output
Name Age City
0 John 28 New York
1 Jane 22 Paris
Here is an example, where we retain only the first row of duplicates.
duplicate = df[df.duplicated(subset='Age',keep='first')]
print(duplicate)
## output
Name Age City
2 John 28 New York
3 Joe 22 London
As you can see, the output is different in both cases, even though we find duplicates based on the same column ‘Age’.
5. Get duplicates in sorted order
Sometimes, the duplicate column values or rows may be scattered all over your dataframe and not present in sequential order. In such cases, you can sort the duplicate values making it easy for users to understand the output. You can sort the dataframe before or after finding duplicates. We will look at both these approaches.
duplicate = df[df.duplicated(['Name', 'Age'], keep=False)].sort_values('Age')
print(duplicate)
## output
Name Age City
0 John 28 New York
2 John 28 New York
In the above example, we sort the output of duplicated() function. Here is an example where we sort the dataframe before finding its duplicates.
sorted_df=df.sort_values(by=['Age'])
duplicate = sorted_df[sorted_df.duplicated(['Name', 'Age'],keep=False)]
print(duplicate)
## output
Name Age City
0 John 28 New York
2 John 28 New York
After you have removed duplicates from your dataframe, you can also export it to excel spreadsheet.
Conclusion
In this article, we have learnt several simple ways to easily find and extract the duplicate rows in a Pandas dataframe. You can use duplicated() function for this purpose. It can be used to find completely duplicate rows, or rows with duplicate values for one or more columns. We have also learnt how to determine whether to retain the first or last rows in a set of duplicate rows. Lastly, we learnt how to sort the result of duplicate rows to easily analyze it. You can use any of these methods as per your requirement. Finding and extracting duplicate rows, or rows with duplicate column values, is a very useful requirement for data preparation and cleansing. You can use any of these solutions as per your requirement.
Also read:
How to Merge and Join Pandas Dataframe
How to Create Pivot Tables in Python Pandas
How to Connect Pandas to Database

Sreeram Sreenivasan is the Founder of Ubiq. He has helped many Fortune 500 companies in the areas of BI & software development.