Feature Type

  • [x] Adding new functionality to pandas

  • [ ] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

Hi,

When doing

df.attrs['array'] = array

df.to_parquet('file.parquet')

I see that I am not saving the array, could this be implemented?

Cheers

Feature Description

The code above would safe array and it would load it when loading the parquet file into a dataframe.

Alternative Solutions

I guess doing it myself separately with some helper function

Additional Context

No response

Comment From: imramraja

Hi! I’d like to work on this issue as my first contribution to pandas. Please assign it to me.

I’ve already started exploring the codebase and implemented a prototype that stores DataFrame.attrs in Parquet file metadata using pyarrow. I plan to support restoring it in read_parquet() as well. Looking forward to your feedback and guidance!

Comment From: acampove

Hi! I’d like to work on this issue as my first contribution to pandas. Please assign it to me.

I’ve already started exploring the codebase and implemented a prototype that stores DataFrame.attrs in Parquet file metadata using pyarrow. I plan to support restoring it in read_parquet() as well. Looking forward to your feedback and guidance!

Im not a pandas maintainer, but you might want to also implement it in other formats. One can save to JSON, CSV, etc. Saving the extra attributes to parquet should not be hard. However I am not sure if there is an easy maintainable way to put it in the other formats such that it does not break anything. The way I see this, the attrs are metadata and I would add a metadata field in the JSON file. For CSV, I do not know how it can be done.

Comment From: arthurlw

xref #54321

Hi thanks for raising this! Saving .attrs attributes to parquet files is already supported in pandas 2.1.0 and above (See the issue linked above).

Closing this for now, but feel free to open another issue if you still encounter issues!