These days almost every website and application needs to work with files. With growing number of users, these files are fast growing in size. Often Python software developers need to be able to read large files while building their applications. There are several ways to read files in Python. But if you are not careful, it may load the entire file in your system’s memory and bring it down. In this article, we will learn how to read large file in Python.
The Problem
Typically, when a file is read by Python or any programming language, it is completely loaded into memory. If the file size is too large, say in GB, then this may overwhelm the memory and crash your system or server, wherever it is being processed. That is why we need to ensure that only the bare minimum part of your file is actually loaded into your system’s memory. In this article, we will learn how to do this.
How to Read Large File in Python
Generally, files are read in Python using readlines() method, which loads the entire file in memory. It returns a list where each element is a line of the file. This list is stored in memory. If your file is large, then by the time your entire file is read, it will occupy a large amount of space in memory. Also, every time it reads a line, it is appended to the list. So it is also time-consuming. Here is a sample code to read the file /home/test.txt using readlines().
f = open('/home/test.txt','r'):
l= f.readlines()
f.close()
The above code will return a list consisting of target file’s lines. To fix this problem, we need to use a file iterator object. Before we proceed, let us learn what an iterator is.
What is Iterator
An iterator is simply an object with countable number of values. You can iterate upon these values sequentially, using a loop. It will keep returning one value at a time, till it reaches end of values. An iterator is like a pointer to the values so it does not load all values in the memory. So it does not require much storage and is memory efficient. Every file is an iterable in Python. So if you want to read a large file, it is better to use an iterator to a file object.
Let us look at some of the simple ways to easily read a large file in Python using iterators.
1. Using Iterator of File Object
In this case, we use an iterator of a file object. For this purpose, we use open() function, that automatically opens the file as well as returns a file object for it. Thereafter, we use an iterator to the file object to iterate over the file and print each line. You can customize the following code to do other things, instead of printing the line.
with open("/home/test.txt") as file:
for line in file:
print(line)
Once the file is completely read, it will be automatically closed, thanks to open() function. All this happens internally in this case, and is not explicitly visible.
This is one of the fastest ways to read a large file in Python. In fact, you can use the same method for all files, big or small.
2. Using fileinput
fileinput is a Python module that helps you to easily loop over list of files, or standard input. It provides an input() function for this purpose. It returns an iterator to the file object. Here is a sample code to illustrate its use.
import fileinput
path = '/home/test.txt'
for line in fileinput.input([path])
print(line)
Please remember that the input() function requires a list of file paths. So you need to input the list of file path strings and not just direct strings of file paths.
3. Using read()
Both the above methods rely on sequentially reading individual lines from a file. What if your file does not contain separate lines but is a continuous series of characters. What if your file does not contain newline character? In this case, you will need to read the file content as chunks.
path = '/home/test.txt'
with open(path) as f:
while True:
c = f.read(1024)
if not c:
break
print(c,end='')
In the above code, we have used a chunk size of 1024 bytes. You can adjust it as per your requirements. We open the file using open() function. Then we run a while loop. In each iteration, we use read() function to read a chunk of file’s data. This goes on as long as there is unread data in your file. When all its data has been read, the code will execute the bread statement and end loop.
4. Using Python Pandas
Pandas is a popular Python library that allows you to easily read and process data, as dataframes, which consist of rows and columns. It provides read_csv() function to easily read text as well as csv files, big or small. Here is a sample code to read a large file using Python pandas.
import pandas as pd
path = '/home/test.txt'
chunk_size = 1000 # no. of rows per chunk
for chunk in pd.read_csv(path, chunksize=chunk_size):
for index, row in chunk.iterrows():
print(row)
In Pandas>=1.2
import pandas as pd
path = '/home/test.txt'
chunk_size = 1000 # no. of rows per chunk
with pd.read_csv(path, chunksize=chunk_size) as reader:
for chunk in reader:
process(chunk)
In the above code, we read 1000 lines of data in each iteration. Each chunk is a dataframe. You can change it as per your requirement.
5. Using Dask
Dask is a Python library for parallel processing. Dask Dataframe allows you to process Pandas dataframe in a parallel manner. This is super useful for really large files (>100Gb).
If you are using Dask Dataframe, then it also provides read_csv() function that allows you to read large files, like Pandas. Here is a code to demonstrate it. Unlike pandas, dask handles chunk size internally on its own. So you do not need to explicitly specify it.
import dask.dataframe as dd
path = '/home/test.txt'
df = dd.read_csv(path)
for idx, row in df.iterrows():
print(row)
Conclusion
In this article, we have learnt several ways to read large file in Python. You can use any of these methods. The key point is to use iterator to file object so that the entire file is not loaded into memory all at once. A file iterator object is only a pointer to the file and allows you to read the file in manageable chunks without overwhelming your system’s memory. All the methods listed above use the same concept, of an iterator to a file object. You can use any of them as per your requirement.
FAQ
1. Can I use the above solution for any file type?
Yes. You can use the above solution for any file type. The first solutions read the files line by line so they work especially well for text and csv files. Solution #3 works for pretty much all files whether their content is organized as lines or not.
2. Is there a file limit for which these solutions work?
No. Each solution is scalable and works for large files. But if you have really large files (>100 Gb), then you may want to try solution #5 since it employs parallel processing.
3. Can they be used in another Python script?
Yes. You can easily include them in your Python script, website or application.
Also read:
How to Execute System Command in Python
How to Check if File Exists in Python
How to Merge Two Dictionaries in Python
Sreeram Sreenivasan is the Founder of Ubiq. He has helped many Fortune 500 companies in the areas of BI & software development.