Pandas is a powerful Python library that allows you to easily store and analyze data in a tabular manner, as rows and columns. They are called dataframes, and allow you to easily access, modify, manipulate and filter data. Sometimes, Python developers need to loop through the rows in Pandas dataframe. There are several ways to do this. In this article, we will learn how to iterate over rows in Pandas dataframe.
How to Iterate Over Rows in Pandas DataFrame
Here are the common ways to iterate over dataframe rows. Let us say you have the following pandas dataframe.
import pandas as pd
data = {'ID':[1, 2, 3],'Name':['John','Jim', 'Joe'],'Marks':[92,95,94]}
df=pd.DataFrame(data)
print(df)
Here is the output you will see.
ID Name Marks
0 1 John 92
1 2 Jim 95
2 3 Joe 94
In the above dataframe, we have stored 3 rows, one for each student. Each row has 3 columns of data for ID, Name and Marks.
1. Using iterrows()
Let us say you want to calculate the total marks obtained by the 3 students. For this purpose, you will need to iterate over all rows in the table. Each dataframe object supports iterrrows() function that allows you to easily iterate over rows. It returns an iterator to the dataframe, which does not occupy much memory. In each iteration, it returns an index and content object. Here is an example to use iterrows() function to calculate total marks.
total_marks=0
for index, row in df.iterrows():
total_marks+=row['Marks']
print(total_marks)
In the above code, we first define total_marks variable to store the total sum. Then we create a for loop, that retrieves both index and row content in each iteration. In each iteration, we add the value of marks column to total_marks variable. Lastly, we display the total marks. Please note, you need to extract both index as well as row content in for loop, even if you do not use the index anywhere later. Otherwise, it will give an error. Here is an example to show it.
total_marks=0
for row in df.iterrows():
total_marks+=row['Marks']
print(total_marks)
Here is the error message.
ERROR!
Traceback (most recent call last):
File "<main.py>", line 8, in <module>
TypeError: tuple indices must be integers or slices, not str
Although iterrows() is easy to use and returns an iterator, it is somehow very slow for large dataframes. It is suitable for small dataframes.
2. Using itertuples()
For large dataframes, you can use itertuples() function. It is faster than using iterrows() since it returns a named tuple instead of series content. Named tuples are faster and lightweight than series. Also it returns only the row instead of index and row as in case of iterrows(). It is also available by default for each dataframe. Here is how to use it.
total_marks=0
for row in df.itertuples():
total_marks+=row.Marks
print(total_marks)
In the above code, we loop through the result of itertuples(). In each iteration, we add the Marks attribute of the row to total marks. Please note, in this case, we only retrieve the row, instead of index and row in iterrows(). Also, we need to mention the column names as attribute names using dot(.) notation. For example, using row[‘Marks’] will give an error.
total_marks=0
for row in df.itertuples():
total_marks+=row['Marks']
print(total_marks)
Here is the error.
ERROR!
Traceback (most recent call last):
File "<main.py>", line 8, in <module>
TypeError: tuple indices must be integers or slices, not str
3. Using apply()
The apply() function is used to apply a function on every row or column of dataframe. It is vectorized operation and much faster than other looping methods. It is faster than using iterrows or itertuples. It is very useful for performing row-wise or column-wise operations, without explicitly iterating over rows. So it is super useful while working with large datasets. Here is a simple example to add 2 marks to each student’s marks using apply().
def add_marks(row):
return row['Marks']+2
# Apply function row-wise
result = df.apply(add_marks, axis=1)
for res in result:
print(res)
In the above code, we first define an add_marks() function that accepts a row input and adds 2 marks to the value of its Marks attribute.
Then we call apply() function on our data frame. In that, we pass two arguments – add_marks function object and axis argument. We set axis argument value to 1, to apply function row-wise. We store the result in a variable and loop through it to display the updated marks. Here is the output.
94
97
96
4. Using indexes
You can also use index-based iterative functions such as iloc[] or loc[]. iloc[] allows you to access rows using indexes and loc[] allows you to access rows using labels. It is slower than other methods but allows you to precisely access and modify the row you want. Here is a simple example to loop through the rows of dataframe and display each row’s values.
for i in range(len(df)):
print(f"Row {i}: {df.iloc[i]}")
Here is the output.
Row 0: ID 1
Name John
Marks 92
Name: 0, dtype: object
Row 1: ID 2
Name Jim
Marks 95
Name: 1, dtype: object
Row 2: ID 3
Name Joe
Marks 94
Name: 2, dtype: object
5. Using List Comprehension
List comprehension is one of the fastest ways to iterate over rows of dataframe. It is easy to understand, versatile and faster than most of the other solutions to loop over dataframe items. Here is a simple example to quickly iterate every row in dataframe. In each iteration, we extract the value of Marks column and store it in results list. Lastly, we call sum() function on this list.
result = [x for x in df['Marks']]
print(sum(result))
Here is the output.
281
You can easily modify the above code to iterate over multiple columns and in each iteration, store the column values in a tuple and add it to result list.
result = [(x,y) for x,y in zip(df['Name'],df['Marks'])]
print(result)
Here is the output.
[('John', 92), ('Jim', 95), ('Joe', 94)]
In the above code, we use separate loop variables for each column. You can also use a single variable to to fetch the entire row. Here is an example to illustrate it. Here we use column indexes starting from 0, 1, .. to access different column values for each row.
result = [(row[0],row[1]) for row in zip(df['Name'],df['Marks'])]
print(result)
Here is its result.
[('John', 92), ('Jim', 95), ('Joe', 94)]
Conclusion
In this article, we have learnt several simple ways to easily iterate over rows in pandas dataframe. If you are beginner developer working with small datasets, use iterrows(). If you are working with large dataframes, use itertuples(). If you are reasonably experienced, then you can use apply() function or list comprehensions, both of which are faster than using iterrows() or itertuples(). Also, they are versatile and work well with large dataframes too.
FAQ
1. Which is the fastest method to iterate over Pandas dataframe?
Using itertuples() is one of the fastest methods to iterate over Pandas dataframe. They work well even for large dataframes. If vectorization functions are available then they may be even faster in many use cases.
Also read:
How to Iterate Over Dictionaries Using For Loop
How to Find Index of Given Item in Python List
How to Flatten List of Lists in Python
Sreeram Sreenivasan is the Founder of Ubiq. He has helped many Fortune 500 companies in the areas of BI & software development.