Feature Type
-
[x] Adding new functionality to pandas
-
[ ] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
Currently, pandas has separate IO methods for each file format (to_csv, read_parquet, etc.). This requires users to: - Remember multiple method names - Change code when switching formats
Feature Description
A unified save
/read
API would simplify common IO operations while maintaining explicit control when needed:
- File type is inferred from the filepath extension, but a format
arg can be passed to be explicit, raising an error in some cases where the inferred file type disagrees with passed file type.
- Both methods accept **kwargs
and pass them along to the underlying file-type-specific pandas IO methods.
- Optionally, support some basic translation across discrepancies in arg names in existing IO methods (i.e. "usecols" in read_csv
vs "columns" in read_parquet
).
# Simplest happy path:
df.save('data.csv') # Uses to_csv
df = pd.read('data.parquet') # Uses read_parquet
# Optionally, be explicit about expected file type
df.save('data.csv', format="csv") # Uses to_csv
df = pd.read('data.parquet', format="parquet") # Uses read_parquet
# Raises ValueError for conflicting format info:
df.save('data.csv', format='parquet') # Conflicting types
df.save('data.txt', format='csv') # .txt implies text format
# Reading allows overrides for misnamed files (or should we require users to rename their files properly first?)
df = pd.read('mislabeled.txt', format='parquet')
# Not sure if we should allow save when inferred file type is not a standard type:
df.save('data', format='csv') # No extension, needs type
df.save('mydata.unknown', format='csv') # Unclear extension
Alternative Solutions
Existing functionality is OK, just not the simplest to use.
Additional Context
No response
Comment From: zkurtz
My workaround for now is to use dummio, specifically dummio.pandas.df_io, which serves as a draft implementation if there's interest to include this type of thing directly in pandas.
from dummio.pandas import df_io
df = df_io.load("data.parquet")
df_io.save(df, filepath="data.feather")
... etc
Comment From: tsafacjo
Can I work on it ?
Comment From: rhshadrach
Thanks for the request. I'm negative on this feature, it adds more code to maintain and test without providing any new behavior users cannot already and easily access.
In addition, it requires pandas to parse paths and determine the extension, inferring what format to use. Also using kwargs
in a signature prevents tools (linters, IDEs, notebooks) from determining the proper arguments and docs for the user. Both of these (inference and kwargs) I would like to see less of in pandas, not more.