Pandas DOC: min_itemsize for HDFStore append for encoded strings

I'm confused about how to preset min_itemsizes for appending to an HDFStore. Say DataFrame a and b in the MWE below is user-provided, so it can contain any character and the encoding is unknown. Appending a works, but appending b fails even though:

In [4]: len('香')
Out[4]: 1

So far I simply used str.len().max() on the string columns to the the numbers for min_itemsize, but this does not work in the example here. This MWE is of course simplified, but I guess I'm wondering:

how does pytables come up with the string length?
how should I determine the string length? Considering the encoding is unknown, but pytables assumes some encoding / pytables converts the strings to some other object?

In this toy example I could encode the string as utf-8 to get the correct length, but this isn't a general approach:

In [5]: len('香'.encode('utf-8'))
Out[5]: 3

MWE:

import pandas as pd

a = pd.DataFrame([['a', 'b']], columns = ['A', 'B'])
b = pd.DataFrame([['香', 'b']], columns = ['A', 'B'])

store = pd.HDFStore('/tmp/tmpstore')

store.append('df', a, min_itemsizes={'A': 1, 'B': 1})
store.append('df', b, min_itemsizes={'A': 1, 'B': 1}) # fails

Expected Output

ValueError: Trying to store a string with len [3] in [values_block_0] column but this column has a limit of [1]! Consider using min_itemsize to preset the sizes on these columns Closing remaining open files:/tmp/tmpstore...done

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.28-2-MANJARO machine: x86_64 processor: byteorder: little LC_ALL: en_US.UTF-8 LANG: en_DE.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.19.0 nose: 1.3.7 pip: 8.1.2 setuptools: 27.2.0 Cython: 0.23.5 numpy: 1.11.2 scipy: 0.18.1 statsmodels: None xarray: None IPython: 5.1.0 sphinx: 1.4.8 patsy: None dateutil: 2.5.3 pytz: 2016.7 blosc: None bottleneck: None tables: 3.3.0 numexpr: 2.6.1 matplotlib: 1.5.3 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: 1.1.3 pymysql: None psycopg2: 2.6.2 (dt dec pq3 ext lo64) jinja2: 2.8 boto: None pandas_datareader: None

Comment From: jreback

min_itemsize is the kw (not min_itemsizes).

but the size is actually the length of the stored bytes (and not characters), so you need to encode anyhow. your best bet is to simply make a larger than needed size.

In [52]: store = pd.HDFStore('tmp.h5', mode='w')

In [53]: store.append('df', a, min_itemsize={'A': 10, 'B': 10}, encoding='utf-8')

In [54]: store.append('df', b)

Comment From: jreback

docs are here

I suppose a note on using encodings correctly might be nice if you want to do a PR.

Comment From: jreback

going to reopen this as a doc issue as explaining how the len counting works when data is encoded

Comment From: johanneshk

So the assumption here is: The user knows if her string contains encoded characters? In that case calling len(string) wouldn't suffice, instead one would need to do len(string.encode('encoding')) as demonstrated above (or the equivalent on a DataFrame column). In addition the encoding kw needs to be specified when appending. I'd like to submit a PR to clarify this. Will try to do this in the next few days.

Comment From: burcakotlu

I'm trying to store dataframes in a store. In the first dataframe, I have a 'chrNumber' column of 1 in that column. In the further dataframes, I have a 'chrNumber' column of 10,11,12,... and so on in that column. Although I have set min_itemsize={'chrNumber':2} in the store.append() command But, I still get the ValueError:

ValueError: Trying to store a string with len [2] in [chrNumber] column but this column has a limit of [1]! Consider using min_itemsize to preset the sizes on these columns Closing remaining open files:df_all.h5...done

And this is the python code for appending dataframes into a store with min_itemsize.

store.append('df', augmented_chrBased_snp_df, data_columns=['Cancer Type', 'Sample', 'paperDOI', 'genomeAssembly', 'type', 'chrNumber', 'start', 'end', 'originalNucleotide', 'mutatedNucleotide', 'dataSource','strand', 'PyramidineStrand', 'Mutation Type', 'Mutation Subtype'], min_itemsize={'Cancer Type':15, 'Sample':8, 'chrNumber':2})

Comment From: keoughkath

@burcakotlu, it might be due to the HDFStore already existing, therefore the min size for that column is set. I fixed this error in my case by deleting the existing HDFStore and recreating it, ensuring the the min_itemsize parameter was set from the beginning. Hope this helps!

Comment From: riteshpen

Does this issue still need to be resolved? I am looking for a first-time open-source contribution.

Comment From: aleksejs-fomins

@riteshpen Yes please! The documentation for HDFStore is really bad IMHO

Consider the following * https://pandas.pydata.org/docs/reference/api/pandas.HDFStore.put.html * https://pandas.pydata.org/docs/reference/api/pandas.HDFStore.append.html

It is not explained what min_itemsize actually does. It is stated that it is a dictionary, but clearly in other places min_itemsize is used as an integer. What are the actual allowed types of min_itemsize, and what is the expected behaviour?

Comment From: JoeDediop

take

Pandas DOC: min_itemsize for HDFStore append for encoded strings

MWE:

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`