Feature Type
-
[x] Adding new functionality to pandas
-
[ ] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
I have a file with 20Gb of data that I need to process. When I use a pandas dataframe, the full 20Gb need to be loaded. That will make the computer slow or even crash. Can this process be made more efficient by automatically (very very important that the user does not have to do anything here) loads a chunk, processes it, writes it, loads the second chunk, etc.
This is stuff is possible, it is done by ROOT for instance.
Feature Description
This would just work with the normal dataframes, there could be an option like
pd.chunk_size = 100
which would process 100Mb at a time. So that no more than 100 Mb would be in memory.
Alternative Solutions
Alternatively we can
import ROOT
rdf = ROOT.RDataFrame('tree', 'path_to_file.root')
Additional Context
No response
Comment From: snitish
@acampove what is the format of your data? If it is CSV, you can use the chunksize
argument of read_csv
. See https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking
Comment From: acampove
Hi,
Sorry for the delayed reply. The issue was, from what I remember, that loading data from ROOT dataframes to memory and then from there to pandas like:
# Load data in memory from ROOT dataframes
data=rdf.AsNumpy()
# Put data in pandas dataframe
df = pd.DataFrame(data)
was taking too much memory. However that is not a problem of pandas. We likely need to do something like:
l_rdf = _split_dataframes_into_chunks(rdf=rdf)
for rdf in l_rdf:
_process(rdf=rdf)
where _process
can do the translation to pandas and do whatever processing needs to be done. In practice something like:
df = pd.from_root(rdf, nrows=100000)
that loads in chunks might help. However ROOT dataframes tend to have thousands of columns, out of which only 20-30 are used. So Implementing this with pandas itself might not be ideal and some preprocessing step to trim those columns might also be needed.