Pandas ENH: Inform on row & column of failed type conversion when parsing CSV files

Feature Type

[ ] Adding new functionality to pandas
[X] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas

Problem Description

I imagine this must have been raised before, but I can't find references. Apologies if it's a duplicate or a better fit for upstream projects.

When parsing a CSV file, pandas may fail to convert data to (explicit or inferred) types, e.g., because the data are indeed malformed. Pandas informs the user of the failure, and depending on the exact nature also mentions the value it failed to convert.

However, as a user I would like to know:

Which value failed to convert (and ideally, which type it had prior to conversion),
which column (ideally by name and index) the value appeared in, and
which row the value appeared in (by index).

Currently, it appeas there are only workarounds as described, e.g., here: https://stackoverflow.com/a/65036765

I imagine this scenario may also appear in other settings, e.g., with other parsers.

Example scenario:

import os.path
import pandas
import random
import tempfile

random.seed(0)

with tempfile.TemporaryDirectory() as d:
        filepath = os.path.join(d, "my.csv")
        df = pandas.DataFrame([{"A": random.randint(0, 10), "B": str(random.random())} for _ in range(10)])

        df.at[2, "A"] = "4 foo"

        df.to_csv(filepath)
        pandas.read_csv(filepath, dtype={"A": int, "B": float}, engine="c")

Here, the output is

Traceback (most recent call last):
  File "parsers.pyx", line 1160, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspaces/pandas/tc1.py", line 18, in <module>
    df = pandas.read_csv(filepath, dtype={"A": int, "B": float}, engine=engine)
  File "/workspaces/pandas/pandas/io/parsers/readers.py", line 945, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/workspaces/pandas/pandas/io/parsers/readers.py", line 614, in _read
    return parser.read(nrows)
  File "/workspaces/pandas/pandas/io/parsers/readers.py", line 1744, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/workspaces/pandas/pandas/io/parsers/c_parser_wrapper.py", line 233, in read
    chunks = self._reader.read_low_memory(nrows)
  File "parsers.pyx", line 843, in pandas._libs.parsers.TextReader.read_low_memory
  File "parsers.pyx", line 920, in pandas._libs.parsers.TextReader._read_rows
  File "parsers.pyx", line 1065, in pandas._libs.parsers.TextReader._convert_column_data
  File "parsers.pyx", line 1166, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: invalid literal for int() with base 10: '4 foo'

which does mention source and target types as well as the value, but gives no indication as to where the value appeared. This makes debugging harder if, for example, one does not have the CSV available for ex-post analysis.

Using the python engine instead of c, the error message also informs on the column, but not on the row:

Traceback (most recent call last):
  File "/workspaces/pandas/pandas/io/parsers/base_parser.py", line 837, in _cast_types
    values = astype_array(values, cast_type, copy=True)
  File "/workspaces/pandas/pandas/core/dtypes/astype.py", line 183, in astype_array
    values = _astype_nansafe(values, dtype, copy=copy)
  File "/workspaces/pandas/pandas/core/dtypes/astype.py", line 134, in _astype_nansafe
    return arr.astype(dtype, copy=True)
ValueError: invalid literal for int() with base 10: '4 foo'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspaces/pandas/tc1.py", line 18, in <module>
    df = pandas.read_csv(filepath, dtype={"A": int, "B": float}, engine=engine)
  File "/workspaces/pandas/pandas/io/parsers/readers.py", line 945, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/workspaces/pandas/pandas/io/parsers/readers.py", line 614, in _read
    return parser.read(nrows)
  File "/workspaces/pandas/pandas/io/parsers/readers.py", line 1744, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/workspaces/pandas/pandas/io/parsers/python_parser.py", line 288, in read
    conv_data = self._convert_data(data)
  File "/workspaces/pandas/pandas/io/parsers/python_parser.py", line 359, in _convert_data
    return self._convert_to_ndarrays(
  File "/workspaces/pandas/pandas/io/parsers/base_parser.py", line 592, in _convert_to_ndarrays
    cvals = self._cast_types(cvals, cast_type, c)
  File "/workspaces/pandas/pandas/io/parsers/base_parser.py", line 839, in _cast_types
    raise ValueError(
ValueError: Unable to convert column A to type int64

Feature Description

I am new to the pandas codebase, and multiple code paths (c vs. python engines, low_memory, etc.) and components (pandas, numpy, cpython) seem to be involved. So I am seeking guidance on what the right course of action is.

Maybe one could raise a warning, e.g., in pandas/_libs/parsers.pyx of _try_int64_nogil and related functions, at the if error != 0 paths? One woud likely need to make the warning configurable, to avoid it when trying to infer types, say.

Alternative Solutions

May it be possible to change numpy to include the requested information in ndarray.astype or such?

Additional Context

No response

Comment From: lithomas1

Not sure if this is possible.

At least in the Python engine, at the point that conversion is done, the data is already read into a numpy array that is then processed. Mapping element indices back to the original line is challenging, especially if things such as skipped lines are involved.

This is also likely to have performance (both speed+memory) implications I think.

Usually when I get an error like this, having the failed value is usually enough since I can just Ctrl+F (or grep) for it. Does this work for you?

Comment From: DavidToneian

I understand the data has already been read in at the point the exception is raised; to give better error messages still, one would need to keep track of metadata, i.e., which chunk of data originates from where (similar to what the python engine seems to be doing.) Granted, this would have performance impacts as you say, though I would guess it wouldn't be significant.

Regarding Ctrl+F or grepping: I have often encountered this issue with data that are processed in production systems I don't have access to, and that don't allow (for regulatory, security, and practical reasons) to log the files that are fed into the CSV parser. Troubleshooting these kinds of issues would be much easier if one knew where the bad data are present.

I'm happy to try and work on a patch, just wanted to check beforehand whether what the constraints are for a solution to be acceptable.

Comment From: lithomas1

Regarding Ctrl+F or grepping: I have often encountered this issue with data that are processed in production systems I don't have access to, and that don't allow (for regulatory, security, and practical reasons) to log the files that are fed into the CSV parser. Troubleshooting these kinds of issues would be much easier if one knew where the bad data are present.

I'm not sure I follow here. How does having the line number/column number help if you don't have access to the file afterwards?

You can try working on a patch if you want. If it doesn't add a huge amount of complexity, I guess we could take it. (This'll probably have to be gated behind a keyword, though. We can work out the name of the keyword later.)

I'd recommend getting started with the Python parser (in pandas/io/parsers/python_parser.py).

Comment From: DavidToneian

take