Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# Wrote a simple notebook with issue causing dataset:
https://colab.research.google.com/drive/1CM4kmgcQ3mEGlX0lXYiONYk25B8YlWo8?usp=sharing

Issue Description

When I perform pd.read_html() I am getting a table with multiple duplicated columns and also the table does not make any sense.

The following table: Pandas BUG: QST: pd.read_html gives tables with duplicated columns

is generated as: (JPM_0000019617-23-000432_TABLE154.html)

|    | 0                             | 1                             | 2                             | 3                           | 4                           | 5                           | 6                           | 7                           | 8                           |   9 |   10 |   11 | 12                        | 13                        | 14                        | 15                        | 16                        | 17                        | 18                        | 19                        | 20                        |
|---:|:------------------------------|:------------------------------|:------------------------------|:----------------------------|:----------------------------|:----------------------------|:----------------------------|:----------------------------|:----------------------------|----:|-----:|-----:|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|
|  0 | nan                           | nan                           | nan                           | nan                         | nan                         | nan                         | nan                         | nan                         | nan                         | nan |  nan |  nan | nan                       | nan                       | nan                       | nan                       | nan                       | nan                       | nan                       | nan                       | nan                       |
|  1 | nan                           | nan                           | nan                           | Three months ended June 30, | Three months ended June 30, | Three months ended June 30, | Three months ended June 30, | Three months ended June 30, | Three months ended June 30, | nan |  nan |  nan | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, |
|  2 | (in millions)                 | (in millions)                 | (in millions)                 | 2023                        | 2023                        | 2023                        | 2022                        | 2022                        | 2022                        | nan |  nan |  nan | 2023                      | 2023                      | 2023                      | nan                       | nan                       | nan                       | 2022                      | 2022                      | 2022                      |
|  3 | Underwriting                  | Underwriting                  | Underwriting                  | nan                         | nan                         | nan                         | nan                         | nan                         | nan                         | nan |  nan |  nan | nan                       | nan                       | nan                       | nan                       | nan                       | nan                       | nan                       | nan                       | nan                       |
|  4 | Equity                        | Equity                        | Equity                        | $                           | 317                         | nan                         | $                           | 230                         | nan                         | nan |  nan |  nan | $                         | 550                       | nan                       | nan                       | nan                       | nan                       | $                         | 472                       | nan                       |
|  5 | Debt                          | Debt                          | Debt                          | 704                         | 704                         | nan                         | 711                         | 711                         | nan                         | nan |  nan |  nan | 1376                      | 1376                      | nan                       | nan                       | nan                       | nan                       | 1685                      | 1685                      | nan                       |
|  6 | Total underwriting            | Total underwriting            | Total underwriting            | 1021                        | 1021                        | nan                         | 941                         | 941                         | nan                         | nan |  nan |  nan | 1926                      | 1926                      | nan                       | nan                       | nan                       | nan                       | 2157                      | 2157                      | nan                       |
|  7 | Advisory                      | Advisory                      | Advisory                      | 492                         | 492                         | nan                         | 645                         | 645                         | nan                         | nan |  nan |  nan | 1236                      | 1236                      | nan                       | nan                       | nan                       | nan                       | 1437                      | 1437                      | nan                       |
|  8 | Total investment banking fees | Total investment banking fees | Total investment banking fees | $                           | 1513                        | nan                         | $                           | 1586                        | nan                         | nan |  nan |  nan | $                         | 3162                      | nan                       | nan                       | nan                       | nan                       | $                         | 3594                      | nan                       |

Expected Behavior

  • There would be Multi-indexed columns based on table headers
  • Only one of the columns 0, 1, and 2 would exist (and so on for wherever it matters)

Just a thought: Is there something out-of-the-box solution to standardize these kind of tables? (something like pd.melt)

Installed Versions

``` INSTALLED VERSIONS ------------------ commit : 2e218d10984e9919f0296931d92ea851c6a6faf5 python : 3.10.12.final.0 python-bits : 64 OS : Linux OS-release : 5.15.120+ Version : #1 SMP Wed Aug 30 11:19:59 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.5.3 numpy : 1.23.5 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 67.7.2 pip : 23.1.2 Cython : 3.0.6 pytest : 7.4.3 hypothesis : None sphinx : 5.0.2 blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.3 html5lib : 1.1 pymysql : None psycopg2 : 2.9.9 jinja2 : 3.1.2 IPython : 7.34.0 pandas_datareader: 0.10.0 bs4 : 4.11.2 bottleneck : None brotli : None fastparquet : None fsspec : 2023.6.0 gcsfs : 2023.6.0 matplotlib : 3.7.1 numba : 0.58.1 numexpr : 2.8.7 odfpy : None openpyxl : 3.1.2 pandas_gbq : 0.17.9 pyarrow : 9.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.11.4 snappy : None sqlalchemy : 2.0.23 tables : 3.8.0 tabulate : 0.9.0 xarray : 2023.7.0 xlrd : 2.0.1 xlwt : None zstandard : None tzdata : None ```

Comment From: rhshadrach

Thanks for the report! Can you include the code to reproduce in the OP here directly (rather than a link), and try to make it as minimal as possible?

Have you seen the documentation on the table-parsing gotchas?

Comment From: mroeschke

Closing as this appeared to need more information to be actionable