I'm confused about how to preset min_itemsizes for appending to an HDFStore. Say DataFrame a and b in the MWE below is user-provided, so it can contain any character and the encoding is unknown. Appending a works, but appending b fails even though:
In [4]: len('香')
Out[4]: 1
So far I simply used str.len().max() on the string columns to the the numbers for min_itemsize, but this does not work in the example here. This MWE is of course simplified, but I guess I'm wondering:
- how does pytables come up with the string length?
- how should I determine the string length? Considering the encoding is unknown, but pytables assumes some encoding / pytables converts the strings to some other object?
In this toy example I could encode the string as utf-8 to get the correct length, but this isn't a general approach:
In [5]: len('香'.encode('utf-8'))
Out[5]: 3
MWE:
import pandas as pd
a = pd.DataFrame([['a', 'b']], columns = ['A', 'B'])
b = pd.DataFrame([['香', 'b']], columns = ['A', 'B'])
store = pd.HDFStore('/tmp/tmpstore')
store.append('df', a, min_itemsizes={'A': 1, 'B': 1})
store.append('df', b, min_itemsizes={'A': 1, 'B': 1}) # fails
Expected Output
ValueError: Trying to store a string with len [3] in [values_block_0] column but this column has a limit of [1]! Consider using min_itemsize to preset the sizes on these columns Closing remaining open files:/tmp/tmpstore...done
Output of pd.show_versions()
Comment From: jreback
min_itemsize
is the kw (not min_itemsizes
).
but the size is actually the length of the stored bytes (and not characters), so you need to encode anyhow. your best bet is to simply make a larger than needed size.
In [52]: store = pd.HDFStore('tmp.h5', mode='w')
In [53]: store.append('df', a, min_itemsize={'A': 10, 'B': 10}, encoding='utf-8')
In [54]: store.append('df', b)
Comment From: jreback
docs are here
I suppose a note on using encodings correctly might be nice if you want to do a PR.
Comment From: jreback
going to reopen this as a doc issue as explaining how the len counting works when data is encoded
Comment From: johanneshk
So the assumption here is: The user knows if her string contains encoded characters? In that case calling len(string)
wouldn't suffice, instead one would need to do len(string.encode('encoding'))
as demonstrated above (or the equivalent on a DataFrame column). In addition the encoding kw needs to be specified when appending.
I'd like to submit a PR to clarify this. Will try to do this in the next few days.
Comment From: burcakotlu
I'm trying to store dataframes in a store. In the first dataframe, I have a 'chrNumber' column of 1 in that column. In the further dataframes, I have a 'chrNumber' column of 10,11,12,... and so on in that column. Although I have set min_itemsize={'chrNumber':2} in the store.append() command But, I still get the ValueError:
ValueError: Trying to store a string with len [2] in [chrNumber] column but this column has a limit of [1]! Consider using min_itemsize to preset the sizes on these columns Closing remaining open files:df_all.h5...done
And this is the python code for appending dataframes into a store with min_itemsize.
store.append('df', augmented_chrBased_snp_df, data_columns=['Cancer Type', 'Sample', 'paperDOI', 'genomeAssembly', 'type', 'chrNumber', 'start', 'end', 'originalNucleotide', 'mutatedNucleotide', 'dataSource','strand', 'PyramidineStrand', 'Mutation Type', 'Mutation Subtype'], min_itemsize={'Cancer Type':15, 'Sample':8, 'chrNumber':2})
Comment From: keoughkath
@burcakotlu, it might be due to the HDFStore already existing, therefore the min size for that column is set. I fixed this error in my case by deleting the existing HDFStore and recreating it, ensuring the the min_itemsize parameter was set from the beginning. Hope this helps!
Comment From: riteshpen
Does this issue still need to be resolved? I am looking for a first-time open-source contribution.
Comment From: aleksejs-fomins
@riteshpen Yes please! The documentation for HDFStore is really bad IMHO
Consider the following * https://pandas.pydata.org/docs/reference/api/pandas.HDFStore.put.html * https://pandas.pydata.org/docs/reference/api/pandas.HDFStore.append.html
It is not explained what min_itemsize actually does. It is stated that it is a dictionary, but clearly in other places min_itemsize is used as an integer. What are the actual allowed types of min_itemsize, and what is the expected behaviour?
Comment From: JoeDediop
take