How to Find Duplicates in Python Pandas Dataframe

Last updated on May 20th, 2025 at 07:10 am

Python Pandas is commonly used to store and analyze data. Duplicate rows or column values is one of the common problems faced during data analysis. Python developers need to find and remove duplicates from their data. You can easily do this using duplicated() function, available in every Pandas dataframe. In this article, we will learn how to find duplicates in Python pandas dataframe.

Table of Contents

Why Find Duplicates in Python Pandas Dataframe

When we receive data from another source, it may already contain entirely duplicate rows of information. Sometimes, only certain columns may contain duplicate values. Data entry errors can also result in presence of duplicates in data. If you happen to merge or join dataframes or datasets, then the result may contain duplicates. So Python developers will need to identify these duplicate values so that they can modify or remove them from your dataframe. This is an essential aspect of data preparation.

How to Find Duplicates in Python Pandas Dataframe

There are several simple ways to find duplicates in Python pandas. Every Pandas dataframe supports duplicated() function that allows you to identify duplicates, for different use cases. Here is its syntax.

DataFrame.duplicated(subset=None, keep='first')

The duplicated() function basically finds and marks all duplicate rows with a Boolean flag to indicate whether each rows is a duplicate or not. It accepts two arguments – subset and keep. Both are optional. By default duplicated() function will remove duplicates based on all columns. If you want to select rows that contain duplicate values only for certain columns, then you can specify them in subset argument. Keep argument determines which duplicate row to retain. It takes first, last and False values to keep the first, last and no rows.

Let us say you have the following Pandas dataframe.

import pandas as pd

data = {'Name': ['John', 'Jane', 'John', 'Joe'],
        'Age': [28, 22, 28, 22],
        'City': ['New York', 'Paris', 'New York', 'London']}

df = pd.DataFrame(data)
print(df)

## output 

   Name  Age      City
0  John   28  New York
1  Jane   22     Paris
2  John   28  New York
3   Joe   22    London

1. Find duplicates based on all columns

By default, calling duplicated() function on a dataframe will remove flag duplicate rows. Here is an example.

duplicate = df.duplicated()
print(duplicate)

## output

0    False
1    False
2     True
3    False
dtype: bool

In the above output, duplicated() function returns Boolean (True/False) values for each row. True value means the row is a duplicate row, and False value means it is not a duplicate row. If you want the actual duplicate row, then you need to pass the result of duplicated() function again to the dataframe, as shown.

duplicate = df[df.duplicated()]
print(duplicate)

## output

   Name  Age      City
2  John   28  New York

2. Find duplicates based on single column

Sometimes, only a single column of dataframe may contain duplicate values. Or you may want to identify only those rows that contain duplicate values for a specific column. For this purpose, you need to specify the column name as subset argument. Here is an example to find rows with duplicate values for ‘Age’ column.

duplicate = df[df.duplicated(subset='Age')]
OR
duplicate = df[df.duplicated('Age')]

print(duplicate)

## output

   Name  Age      City
2  John   28  New York
3   Joe   22    London

3. Find duplicates based on multiple columns

Similarly, sometimes you may need to find duplicates based on multiple columns. In this case, you need to mention the list of column names in subset argument.

duplicate = df[df.duplicated(subset=['Name','Age'])]
OR
duplicate = df[df.duplicated(['Name','Age'])]

print(duplicate)

## output

   Name  Age      City
2  John   28  New York

4. Get duplicate last rows

In each of the above use cases, duplicated() function will keep the first row and flag all the other rows as duplicates. What if you want to retain the last duplicate row instead? In this case, we can use the keep argument. Here is an example, where we find duplicates for ‘Age’ column but keep only the last row for each group of duplicates.

duplicate = df[df.duplicated(subset='Age',keep='last')]
print(duplicate)

## output

   Name  Age      City
0  John   28  New York
1  Jane   22     Paris

Here is an example, where we retain only the first row of duplicates.

duplicate = df[df.duplicated(subset='Age',keep='first')]
print(duplicate)

## output

   Name  Age      City
2  John   28  New York
3   Joe   22    London

As you can see, the output is different in both cases, even though we find duplicates based on the same column ‘Age’.

5. Get duplicates in sorted order

Sometimes, the duplicate column values or rows may be scattered all over your dataframe and not present in sequential order. In such cases, you can sort the duplicate values making it easy for users to understand the output. You can sort the dataframe before or after finding duplicates. We will look at both these approaches.

duplicate = df[df.duplicated(['Name', 'Age'], keep=False)].sort_values('Age')
print(duplicate)

## output
   Name  Age      City
0  John   28  New York
2  John   28  New York

In the above example, we sort the output of duplicated() function. Here is an example where we sort the dataframe before finding its duplicates.

sorted_df=df.sort_values(by=['Age'])
duplicate = sorted_df[sorted_df.duplicated(['Name', 'Age'],keep=False)]
print(duplicate)

## output

   Name  Age      City
0  John   28  New York
2  John   28  New York

After you have removed duplicates from your dataframe, you can also export it to excel spreadsheet.

Conclusion

In this article, we have learnt several simple ways to easily find and extract the duplicate rows in a Pandas dataframe. You can use duplicated() function for this purpose. It can be used to find completely duplicate rows, or rows with duplicate values for one or more columns. We have also learnt how to determine whether to retain the first or last rows in a set of duplicate rows. Lastly, we learnt how to sort the result of duplicate rows to easily analyze it. You can use any of these methods as per your requirement. Finding and extracting duplicate rows, or rows with duplicate column values, is a very useful requirement for data preparation and cleansing. You can use any of these solutions as per your requirement.

What does if name == "main": do?

Does Python Have Ternary Operator?

How to Randomly Select Item from Python List

How to Flatten List of Lists in Python

Sreeram Sreenivasan

Sreeram Sreenivasan is the Founder of Ubiq. He has helped many Fortune 500 companies in the areas of BI & software development.

How to Find Duplicates in Python Pandas Dataframe

Why Find Duplicates in Python Pandas Dataframe

How to Find Duplicates in Python Pandas Dataframe

1. Find duplicates based on all columns

2. Find duplicates based on single column

3. Find duplicates based on multiple columns

4. Get duplicate last rows

5. Get duplicates in sorted order

Conclusion

Related posts:

What does if name == "main": do?

Does Python Have Ternary Operator?

How to Randomly Select Item from Python List

How to Flatten List of Lists in Python

Leave a Reply Cancel reply

Why Find Duplicates in Python Pandas Dataframe

How to Find Duplicates in Python Pandas Dataframe

1. Find duplicates based on all columns

2. Find duplicates based on single column

3. Find duplicates based on multiple columns

4. Get duplicate last rows

5. Get duplicates in sorted order

Conclusion

Related posts:

What does if __name__ == "__main__": do?

Does Python Have Ternary Operator?

How to Randomly Select Item from Python List

How to Flatten List of Lists in Python

Share this:

Leave a Reply Cancel reply

What does if name == "main": do?