Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# Wrote a simple notebook with issue causing dataset:
https://colab.research.google.com/drive/1CM4kmgcQ3mEGlX0lXYiONYk25B8YlWo8?usp=sharing
Issue Description
When I perform pd.read_html()
I am getting a table with multiple duplicated columns and also the table does not make any sense.
The following table:
is generated as: (JPM_0000019617-23-000432_TABLE154.html)
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
|---:|:------------------------------|:------------------------------|:------------------------------|:----------------------------|:----------------------------|:----------------------------|:----------------------------|:----------------------------|:----------------------------|----:|-----:|-----:|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|:--------------------------|
| 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 1 | nan | nan | nan | Three months ended June 30, | Three months ended June 30, | Three months ended June 30, | Three months ended June 30, | Three months ended June 30, | Three months ended June 30, | nan | nan | nan | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, | Six months ended June 30, |
| 2 | (in millions) | (in millions) | (in millions) | 2023 | 2023 | 2023 | 2022 | 2022 | 2022 | nan | nan | nan | 2023 | 2023 | 2023 | nan | nan | nan | 2022 | 2022 | 2022 |
| 3 | Underwriting | Underwriting | Underwriting | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 4 | Equity | Equity | Equity | $ | 317 | nan | $ | 230 | nan | nan | nan | nan | $ | 550 | nan | nan | nan | nan | $ | 472 | nan |
| 5 | Debt | Debt | Debt | 704 | 704 | nan | 711 | 711 | nan | nan | nan | nan | 1376 | 1376 | nan | nan | nan | nan | 1685 | 1685 | nan |
| 6 | Total underwriting | Total underwriting | Total underwriting | 1021 | 1021 | nan | 941 | 941 | nan | nan | nan | nan | 1926 | 1926 | nan | nan | nan | nan | 2157 | 2157 | nan |
| 7 | Advisory | Advisory | Advisory | 492 | 492 | nan | 645 | 645 | nan | nan | nan | nan | 1236 | 1236 | nan | nan | nan | nan | 1437 | 1437 | nan |
| 8 | Total investment banking fees | Total investment banking fees | Total investment banking fees | $ | 1513 | nan | $ | 1586 | nan | nan | nan | nan | $ | 3162 | nan | nan | nan | nan | $ | 3594 | nan |
Expected Behavior
- There would be Multi-indexed columns based on table headers
- Only one of the columns
0
,1
, and2
would exist (and so on for wherever it matters)
Just a thought: Is there something out-of-the-box solution to standardize these kind of tables? (something like
pd.melt
)
Installed Versions
Comment From: rhshadrach
Thanks for the report! Can you include the code to reproduce in the OP here directly (rather than a link), and try to make it as minimal as possible?
Have you seen the documentation on the table-parsing gotchas?
Comment From: mroeschke
Closing as this appeared to need more information to be actionable