Pandas "invalid dtype determination in get_concat_dtype" when concating dfs with certain columns

Code Sample, a copy-pastable example if possible

There might be a simpler minimal example, but I was already really struggeling to identify this problem and to find this example. The problem seems to be related to strings reappearing in different positions of the tuples, different length tuples and unequal sets of columns.

(btw. I'm aware of MultiIndex, I would like to convert the Index to MultiIndex after the concatenation)

items_a = [("b","e","c","a","b"),
         ("e","e","c","a","c"),
         ("e","a","c","a","d"),
         ("b","a","b","e"),
         ("e","b","a"),
         ("e","c","c","a")]
items_b = [("b","e","c","a","b"),
         ("a","a","d","b","d"),
         ("a","b","d","b","e"),
         ("c","b","c","a"),
         ("a","c","b"),
         ("a","d","d","b")]
df1=pd.DataFrame([range(6)], columns=items_a)
df2=pd.DataFrame([range(6)], columns=items_b)
pd.concat([df1, df2])

Problem description

This yields

AssertionError: invalid dtype determination in get_concat_dtype

Expected Output

Something similar to

df1.columns = [str(c) for c in df1.columns]
df2.columns = [str(c) for c in df2.columns]
pd.concat([df1, df2])

Output of `pd.show_versions()`

(same result with pandas=0.17.1)

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-116-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.utf8 LANG: en_US.UTF-8 LOCALE: None.None pandas: 0.23.0.dev0+38.g6552718 pytest: 2.8.7 pip: 9.0.1 setuptools: 20.7.0 Cython: 0.23.4 numpy: 1.14.2 scipy: 1.0.0 pyarrow: None xarray: None IPython: 5.5.0 sphinx: 1.3.6 patsy: 0.4.1 dateutil: 2.7.2 pytz: 2018.3 blosc: None bottleneck: None tables: 3.2.2 numexpr: 2.6.4 feather: None matplotlib: 2.1.2 openpyxl: 2.3.0 xlrd: 0.9.4 xlwt: 0.7.5 xlsxwriter: 0.7.3 lxml: 3.5.0 bs4: 4.4.1 html5lib: 0.9999999 sqlalchemy: 1.0.11 pymysql: None psycopg2: 2.6.1 (dt dec mx pq3 ext lo64) jinja2: 2.8 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: jreback

you are fighting pandas here - i suppose this could be supported but its not efficient in the least, nor very useful in terms of indexing

you would very likely need a custom index type to have an real support here - quite a major effort - if you wanted to contribute this great

Comment From: jreback

cc @toobaz

Comment From: Mofef

Oh, so you are aware of the problem? So could you explain me a bit more about why it fails, please? I can assure you that i'm not fighting pandas on purpose. ;) But I don't really understand what is going on. So I can't find a workaround except of converting the column names to string and back. I can't even consistently reproduce the error yet. Maybe a more informative error message would already be enough to resolve this issue? I sure would help as soon as I understand the problem.

I in case you were wondering, what I actually do is to convert a tree structures to a pandas DataFrame. One line representing one tree. (The trees are very similar but not always identical in structure). So those tuples (columns) give the path through the tree. The data is given by the leafs. The problem apparently occures when a child contains a similar object as its parent. For some cases it fails with the same error also if I use pd.MultiIndex.from_tuples. Though not in the example described in the OP.

Comment From: jreback

why are you not using a MultiIndex?

Comment From: Mofef

Originally i wanted to convert it to a MultiIndex after concatenating, but sure, that would be an acceptable workaround. Though, for my case it also failed with the same error when concatenating. (Not for the example above)

Comment From: jreback

then show an example using MI that fails

Comment From: Mofef

Weird... my testcase must have been flawed... I can't reproduce it anymore. So thanks a lot for the help.

Still, if you had the patience to explain I would be really interested in what is going wrong in the example above.

Comment From: jreback

this actually breaks in a different place in master. cc @TomAugspurger

In [3]: items_a = [("b","e","c","a","b"),
   ...:          ("e","e","c","a","c"),
   ...:          ("e","a","c","a","d"),
   ...:          ("b","a","b","e"),
   ...:          ("e","b","a"),
   ...:          ("e","c","c","a")]
   ...: items_b = [("b","e","c","a","b"),
   ...:          ("a","a","d","b","d"),
   ...:          ("a","b","d","b","e"),
   ...:          ("c","b","c","a"),
   ...:          ("a","c","b"),
   ...:          ("a","d","d","b")]
   ...: df1=pd.DataFrame([range(6)], columns=items_a)
   ...: df2=pd.DataFrame([range(6)], columns=items_b)
   ...: pd.concat([df1, df2])
   ...:          
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-355de89f5317> in <module>()
     13 df1=pd.DataFrame([range(6)], columns=items_a)
     14 df2=pd.DataFrame([range(6)], columns=items_b)
---> 15 pd.concat([df1, df2])
     16 

~/pandas/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    211                        verify_integrity=verify_integrity,
    212                        copy=copy)
--> 213     return op.get_result()
    214 
    215 

~/pandas/pandas/core/reshape/concat.py in get_result(self)
    406             new_data = concatenate_block_managers(
    407                 mgrs_indexers, self.new_axes, concat_axis=self.axis,
--> 408                 copy=self.copy)
    409             if not self.copy:
    410                 new_data._consolidate_inplace()

~/pandas/pandas/core/internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   5372                 values = values.view()
   5373             b = b.make_block_same_class(values, placement=placement)
-> 5374         elif is_uniform_join_units(join_units):
   5375             b = join_units[0].block.concat_same_type(
   5376                 [ju.block for ju in join_units], placement=placement)

~/pandas/pandas/core/internals.py in is_uniform_join_units(join_units)
   5396         # no blocks that would get missing values (can lead to type upcasts)
   5397         # unless we're an extension dtype.
-> 5398         all(not ju.is_na or ju.block.is_extension for ju in join_units) and
   5399         # no blocks with indexers (as then the dimensions do not fit)
   5400         all(not ju.indexers for ju in join_units) and
~/pandas/pandas/core/internals.py in <genexpr>(.0)
   5396         # no blocks that would get missing values (can lead to type upcasts)
   5397         # unless we're an extension dtype.
-> 5398         all(not ju.is_na or ju.block.is_extension for ju in join_units) and
   5399         # no blocks with indexers (as then the dimensions do not fit)
   5400         all(not ju.indexers for ju in join_units) and

AttributeError: 'NoneType' object has no attribute 'is_extension'
> /Users/jreback/pandas/pandas/core/internals.py(5398)<genexpr>()
   5396         # no blocks that would get missing values (can lead to type upcasts)
   5397         # unless we're an extension dtype.
-> 5398         all(not ju.is_na or ju.block.is_extension for ju in join_units) and
   5399         # no blocks with indexers (as then the dimensions do not fit)
   5400         all(not ju.indexers for ju in join_units) and

I didn't think a JoinUnit could be None

Comment From: Mofef

#20757 might be what caused my observation that this issue also occured when using MultiIndex (referring to @jreback 's comment here https://github.com/pandas-dev/pandas/issues/20597#issuecomment-378609474 )

Pandas "invalid dtype determination in get_concat_dtype" when concating dfs with certain columns

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`