Code Sample, a copy-pastable example if possible

Please comsider merging

https://github.com/pandas-dev/pandas/compare/master...JacekPliszka:master

Problem description

Currently pandas can not add custom metadata to parquet file.

This patch add metadata argument to DataFrame.to_parquet that allows for that. Warning is issued when pandas key is present in the dictionary passed.

Comment From: TomAugspurger

cc @cpcloud

What's the purpose here? Would this be in addition to or in place of the usual pandas_metadata?

Comment From: JacekPliszka

The user given dictionary updates current key value file metadata. If user gives pandas key then it overwrites pandas_metadata but warning.warn is issued.

Purpose:

User metadata is very needed when:

  1. processing is done in several stages and you want to keep information about version/algorithm used on each stage so you can debug it later

  2. processing is done with different parameters and you want to keep parameters used with the file

  3. you need to add extra custom information e.g. sometimes one column comes from one source and sometimes it is calculated from other columns and you want to keep this information and pass to later stages of processing

  4. you have certain very high level aggregates that are costly to compute and you do not want to create columns for them

For me it is a very important feature and one of the main reasons I want to switch to parquet.

Comment From: TomAugspurger

That all sounds reasonable.

Comment From: JacekPliszka

Slight cosmetic suggestion - code a bit more Pythonic

Comment From: JacekPliszka

And added whatsnew and rebased to current master.

Comment From: jorisvandenbossche

Note for readers: the PR was closed but mentions a work-around that can be used for now if you need this: https://github.com/pandas-dev/pandas/pull/20534#issuecomment-453236538

Comment From: snowman2

I have been thinking about this and am wondering what the general thoughts are to use DataFrame.attrs and Series.attrs for reading and writing metadata to/from parquet?

For example, here is how the metadata would be written:

pdf = pandas.DataFrame({"a": [1]})
pdf.attrs = {"name": "my custom dataset"}
pdf.a.attrs = {"long_name": "Description about data", "nodata": -1, "units": "metre"}
pdf.to_parquet("file.parquet")

Then, when loading in the data:

pdf = pandas.read_parquet("file.parquet")
pdf.attrs
{"name": "my custom dataset"}
pdf.a.attrs
{"long_name": "Description about data", "nodata": -1, "units": "metre"}

Is this something that would need to be done in pandas or pyarrow/fastparquet?

EDIT: Added issue to pyarrow here

Comment From: snowman2

Here is a hack to get the attrs to work with pyarrow:

def _write_attrs(table, pdf):
    schema_metadata = table.schema.metadata or {}
    pandas_metadata = json.loads(schema_metadata.get(b"pandas", "{}"))
    column_attrs = {}
    for col in pdf.columns:
        attrs = pdf[col].attrs
        if not attrs or not isinstance(col, str):
            continue
        column_attrs[col] = attrs
    pandas_metadata.update(
        attrs=pdf.attrs,
        column_attrs=column_attrs,
    )
    schema_metadata[b"pandas"] = json.dumps(pandas_metadata)
    return table.replace_schema_metadata(schema_metadata)


def _read_attrs(table, pdf):
    schema_metadata = table.schema.metadata or {}
    pandas_metadata = json.loads(schema_metadata.get(b"pandas", "{}"))
    pdf.attrs = pandas_metadata.get("attrs", {})
    col_attrs = pandas_metadata.get("column_attrs", {})
    for col in pdf.columns:
        pdf[col].attrs = col_attrs.get(col, {})


def to_parquet(pdf, filename):
    # write parquet file with attributes
    table = pyarrow.Table.from_pandas(pdf)
    table = _write_attrs(table, pdf)
    pyarrow.parquet.write_table(table, filename)


def read_parquet(filename):
    # read parquet file with attributes
    table = pyarrow.parquet.read_pandas(filename)
    pdf = table.to_pandas()
    _read_attrs(table, pdf)
    return pdf

Example:

Writing:

pdf = pandas.DataFrame({"a": [1]})
pdf.attrs = {"name": "my custom dataset"}
pdf.a.attrs = {"long_name": "Description about data", "nodata": -1, "units": "metre"}
to_parquet(pdf, "a.parquet")

Reading:

pdf = read_parquet("a.parquet")
pdf.attrs
{"name": "my custom dataset"}
pdf.a.attrs
{"long_name": "Description about data", "nodata": -1, "units": "metre"}

Comment From: snowman2

I have a PR that seems to do the trick: #41545

Comment From: jorisvandenbossche

Is this something that would need to be done in pandas or pyarrow/fastparquet?

Ideally, I think this would actually be done in pyarrow/fastparquet, as it is in those libraries that the "pandas" metadata item gets constructed currently

Comment From: arogozhnikov

so... can we have simple something to work with df.attrs ?

The goal is to replace multiple pseudo-csv formats which add #-prefixed comments in the beginning of a file with something systematic.

I believe everyone would agree that's 1) a common usecase 2) supportable by parquet 3) should work without hassle for reader (I'm ok with hassle for writer)

Comment From: jorisvandenbossche

I believe everyone would agree that's 1) a common usecase 2) supportable by parquet 3) should work without hassle for reader (I'm ok with hassle for writer)

Yes, and a contribution to add this functionality is welcome, I think. https://github.com/pandas-dev/pandas/pull/41545 tried to do this but was only closed because it also wanted to store column-level attrs (which was the main driver for the PR author), not because we don't want this in general. A PR focusing on storing/restoring DataFrame-level attrs is welcome.

And a PR to add generic parquet file-level metadata with a metadata keyword (as was attempted in #20534, and the original purpose of this issue) is also still welcome I think.

Comment From: davetapley

Edit don't need this ⬇️ since 2.1.0 ⚠️

~My workaround (assuming fastparquet)~:

# write
df.to_parquet(path)
meta = {'foo':'bar'}
fastparquet.update_file_custom_metadata(path, meta)

# read
pf = fastparquet.ParquetFile(path)
df_ = pf.to_pandas()
meta_ = pf.key_value_metadata

Note meta must be dict[str, str] (so no nested dicts without bring-your-own serialization).

Comment From: davetapley

This is done and in 2.1.0 🎉 - https://github.com/pandas-dev/pandas/pull/54346