Code Sample, a copy-pastable example if possible
Please comsider merging
https://github.com/pandas-dev/pandas/compare/master...JacekPliszka:master
Problem description
Currently pandas can not add custom metadata to parquet file.
This patch add metadata argument to DataFrame.to_parquet that allows for that. Warning is issued when pandas key is present in the dictionary passed.
Comment From: TomAugspurger
cc @cpcloud
What's the purpose here? Would this be in addition to or in place of the usual pandas_metadata
?
Comment From: JacekPliszka
The user given dictionary updates current key value file metadata. If user gives pandas key then it overwrites pandas_metadata but warning.warn is issued.
Purpose:
User metadata is very needed when:
-
processing is done in several stages and you want to keep information about version/algorithm used on each stage so you can debug it later
-
processing is done with different parameters and you want to keep parameters used with the file
-
you need to add extra custom information e.g. sometimes one column comes from one source and sometimes it is calculated from other columns and you want to keep this information and pass to later stages of processing
-
you have certain very high level aggregates that are costly to compute and you do not want to create columns for them
For me it is a very important feature and one of the main reasons I want to switch to parquet.
Comment From: TomAugspurger
That all sounds reasonable.
Comment From: JacekPliszka
Slight cosmetic suggestion - code a bit more Pythonic
Comment From: JacekPliszka
And added whatsnew and rebased to current master.
Comment From: jorisvandenbossche
Note for readers: the PR was closed but mentions a work-around that can be used for now if you need this: https://github.com/pandas-dev/pandas/pull/20534#issuecomment-453236538
Comment From: snowman2
I have been thinking about this and am wondering what the general thoughts are to use DataFrame.attrs and Series.attrs for reading and writing metadata to/from parquet?
For example, here is how the metadata would be written:
pdf = pandas.DataFrame({"a": [1]})
pdf.attrs = {"name": "my custom dataset"}
pdf.a.attrs = {"long_name": "Description about data", "nodata": -1, "units": "metre"}
pdf.to_parquet("file.parquet")
Then, when loading in the data:
pdf = pandas.read_parquet("file.parquet")
pdf.attrs
{"name": "my custom dataset"}
pdf.a.attrs
{"long_name": "Description about data", "nodata": -1, "units": "metre"}
Is this something that would need to be done in pandas or pyarrow/fastparquet?
EDIT: Added issue to pyarrow here
Comment From: snowman2
Here is a hack to get the attrs to work with pyarrow:
def _write_attrs(table, pdf):
schema_metadata = table.schema.metadata or {}
pandas_metadata = json.loads(schema_metadata.get(b"pandas", "{}"))
column_attrs = {}
for col in pdf.columns:
attrs = pdf[col].attrs
if not attrs or not isinstance(col, str):
continue
column_attrs[col] = attrs
pandas_metadata.update(
attrs=pdf.attrs,
column_attrs=column_attrs,
)
schema_metadata[b"pandas"] = json.dumps(pandas_metadata)
return table.replace_schema_metadata(schema_metadata)
def _read_attrs(table, pdf):
schema_metadata = table.schema.metadata or {}
pandas_metadata = json.loads(schema_metadata.get(b"pandas", "{}"))
pdf.attrs = pandas_metadata.get("attrs", {})
col_attrs = pandas_metadata.get("column_attrs", {})
for col in pdf.columns:
pdf[col].attrs = col_attrs.get(col, {})
def to_parquet(pdf, filename):
# write parquet file with attributes
table = pyarrow.Table.from_pandas(pdf)
table = _write_attrs(table, pdf)
pyarrow.parquet.write_table(table, filename)
def read_parquet(filename):
# read parquet file with attributes
table = pyarrow.parquet.read_pandas(filename)
pdf = table.to_pandas()
_read_attrs(table, pdf)
return pdf
Example:
Writing:
pdf = pandas.DataFrame({"a": [1]})
pdf.attrs = {"name": "my custom dataset"}
pdf.a.attrs = {"long_name": "Description about data", "nodata": -1, "units": "metre"}
to_parquet(pdf, "a.parquet")
Reading:
pdf = read_parquet("a.parquet")
pdf.attrs
{"name": "my custom dataset"}
pdf.a.attrs
{"long_name": "Description about data", "nodata": -1, "units": "metre"}
Comment From: snowman2
I have a PR that seems to do the trick: #41545
Comment From: jorisvandenbossche
Is this something that would need to be done in pandas or pyarrow/fastparquet?
Ideally, I think this would actually be done in pyarrow/fastparquet, as it is in those libraries that the "pandas" metadata item gets constructed currently
Comment From: arogozhnikov
so... can we have simple something to work with df.attrs ?
The goal is to replace multiple pseudo-csv formats which add #-prefixed comments in the beginning of a file with something systematic.
I believe everyone would agree that's 1) a common usecase 2) supportable by parquet 3) should work without hassle for reader (I'm ok with hassle for writer)
Comment From: jorisvandenbossche
I believe everyone would agree that's 1) a common usecase 2) supportable by parquet 3) should work without hassle for reader (I'm ok with hassle for writer)
Yes, and a contribution to add this functionality is welcome, I think. https://github.com/pandas-dev/pandas/pull/41545 tried to do this but was only closed because it also wanted to store column-level attrs (which was the main driver for the PR author), not because we don't want this in general. A PR focusing on storing/restoring DataFrame-level attrs is welcome.
And a PR to add generic parquet file-level metadata with a metadata
keyword (as was attempted in #20534, and the original purpose of this issue) is also still welcome I think.
Comment From: davetapley
Edit don't need this ⬇️ since 2.1.0
⚠️
~My workaround (assuming fastparquet
)~:
# write
df.to_parquet(path)
meta = {'foo':'bar'}
fastparquet.update_file_custom_metadata(path, meta)
# read
pf = fastparquet.ParquetFile(path)
df_ = pf.to_pandas()
meta_ = pf.key_value_metadata
Note meta
must be dict[str, str]
(so no nested dicts without bring-your-own serialization).
Comment From: davetapley
This is done and in 2.1.0
🎉
- https://github.com/pandas-dev/pandas/pull/54346