Code Sample, a copy-pastable example if possible
d=pandas.DataFrame({('A','A'): [0,1,2,3,4,5]})
d[('A2','A2')] = 1
d.columns = [['A','A'],['A','A']]
print(d)
d.to_csv('TEST.csv')
print(pandas.read_csv('TEST.csv',header=[0,1], index_col=0))
d.to_excel('TEST.xlsx')
print(pandas.read_excel('TEST.xlsx',header=[0,1], index_col=0))
Problem description
The above snippet generates the following output:
A
A A
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
(A, A) ('A', 'A').1
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-110-4745297eb588> in <module>()
6 print(pandas.read_csv('TEST.csv',header=[0,1], index_col=0))
7 d.to_excel('TEST.xlsx')
----> 8 print(pandas.read_excel('TEST.xlsx',header=[0,1], index_col=0))
c:\programdata\anaconda3\lib\site-packages\pandas\io\excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, dtype, true_values, false_values, engine, squeeze, **kwds)
207 skip_footer=skip_footer, converters=converters, dtype=dtype,
208 true_values=true_values, false_values=false_values, squeeze=squeeze,
--> 209 **kwds)
210
211
c:\programdata\anaconda3\lib\site-packages\pandas\io\excel.py in _parse_excel(self, sheetname, header, skiprows, names, skip_footer, index_col, has_index_names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, true_values, false_values, verbose, dtype, squeeze, **kwds)
515 if not squeeze or isinstance(output[asheetname], DataFrame):
516 output[asheetname].columns = output[
--> 517 asheetname].columns.set_names(header_names)
518 except EmptyDataError:
519 # No Data, return an empty DataFrame
c:\programdata\anaconda3\lib\site-packages\pandas\core\indexes\base.py in set_names(self, names, level, inplace)
1119 else:
1120 idx = self._shallow_copy()
-> 1121 idx._set_names(names, level=level)
1122 if not inplace:
1123 return idx
c:\programdata\anaconda3\lib\site-packages\pandas\core\indexes\base.py in _set_names(self, values, level)
1059 if len(values) != 1:
1060 raise ValueError('Length of new names must be 1, got %d' %
-> 1061 len(values))
1062 self.name = values[0]
1063
ValueError: Length of new names must be 1, got 2
Note that read_csv will read the csv (but mangles the column index). read_excel fails without clearly indicating the nature of the error.
Expected Output
read_excel should approximately match what read_csv does. If not, it would be a lot easier to diagnose the error if the error message indicated that the problem was a duplicate column and, ideally, which columns is the cause.
Output of pd.show_versions()
Comment From: chris-b1
Sure, I think this could be made to work similar to CSV (though in general life is easier without duplicate columns). PR to fix would be welcome!
Comment From: spillz
If I was in a position to submit PRs, I would. But as I am not, I thought a bug report would be better then nothing.
I am fine with the duplicate cols being treated as an error but keep in mind that means you can no longer open arbitrarily named datasets. Also, the read_csv behavior really isn't ideal either. The multiindex becomes a regular index and there are no warnings when duplicates are found and cols renamed.
Anyway the main reason I reported this as a bug is that it took me half an hour to figure out that the error was being caused by duplicate columns. The message is very obscure!
On Jan 25, 2018 1:21 PM, "chris-b1" notifications@github.com wrote:
Sure, I think this could be made to work similar to CSV (though in general life is easier without duplicate columns). PR to fix would be welcome!
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/19395#issuecomment-360554365, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFd5c5WfoQyY2_9Dcagms0pWjDIfT-cks5tOMY2gaJpZM4RtPKI .
Comment From: chris-b1
Appreciate the report either way, would also take a PR for an improved error msg which would be easier
Comment From: jrhaneydata
I just upgrade from Pandas 0.22 to 0.24, and my code is breaking. What's happening in my code appears related to this issue.
I have a worksheet in Excel with three tables in it. Those three tables have the same columns.
In version 0.22.0 of pandas, I could read all three of those tables into python with three read_excel statements, using the usecols argument to specify which part of the spreadsheet I wanted to read in.
In version 0.24.1 of pandas, the presence of duplicate column names in a part of the spreadsheet that I'm not reading into Python forces the column names in the imported dataframe to have ".1" suffixes appended to them.
I can adjust my code to rename the columns after import, of course, but that's ugly. Was this an intentional change to the read_excel function that hasn't been documented? Or is this a bug?