Feature Type
-
[x] Adding new functionality to pandas
-
[ ] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
Hi,
When doing
df.attrs['array'] = array
df.to_parquet('file.parquet')
I see that I am not saving the array, could this be implemented?
Cheers
Feature Description
The code above would safe array
and it would load it when loading the parquet file into a dataframe.
Alternative Solutions
I guess doing it myself separately with some helper function
Additional Context
No response
Comment From: imramraja
Hi! I’d like to work on this issue as my first contribution to pandas. Please assign it to me.
I’ve already started exploring the codebase and implemented a prototype that stores DataFrame.attrs
in Parquet file metadata using pyarrow
. I plan to support restoring it in read_parquet()
as well.
Looking forward to your feedback and guidance!
Comment From: acampove
Hi! I’d like to work on this issue as my first contribution to pandas. Please assign it to me.
I’ve already started exploring the codebase and implemented a prototype that stores
DataFrame.attrs
in Parquet file metadata usingpyarrow
. I plan to support restoring it inread_parquet()
as well. Looking forward to your feedback and guidance!
Im not a pandas maintainer, but you might want to also implement it in other formats. One can save to JSON, CSV, etc. Saving the extra attributes to parquet should not be hard. However I am not sure if there is an easy maintainable way to put it in the other formats such that it does not break anything. The way I see this, the attrs
are metadata and I would add a metadata field in the JSON file. For CSV, I do not know how it can be done.
Comment From: arthurlw
xref #54321
Hi thanks for raising this! Saving .attrs
attributes to parquet files is already supported in pandas 2.1.0 and above (See the issue linked above).
Closing this for now, but feel free to open another issue if you still encounter issues!