Pandas ENH: Auto-detect text encoding to avoid UnicodeDecodeErrors

Feature Type

[X] Adding new functionality to pandas
[ ] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas

Problem Description

Our proposal improves the robustness of pandas' text importers, in particular the read_csv() function. Currently, an explicit encoding can be set or it defaults to None, which seems to be resolved to 'utf-8', but maybe this is platform-specific. Unluckily, csv files often come with different encodings. For example, Excel does not use UTF-8 by default and often users do not really care about encodings while saving such that we have to handle different file encondings. Unluckily, pandas raises UnicodeDecodeErrors if something else than 'utf-8' is required, even though text editors automatically detect the right encoding.

Feature Description

Several resources suggest to automatically detect the right enconding using chardet.detect(). Using this, the following code successfully recognized the right encoding in our experiments ('utf-8' or 'ISO-8859-1'):

import chardet
import io
filename = 'path/to/some/file.csv'     # source file
encoding = None                        # encoding can be predefined or not
with open(filename, 'rb') as file:
    data = file.read()
if encoding is None:                   # if not explicitly given, this line detects the right encoding
    encoding = chardet.detect(data)['encoding']
pd.read_csv(io.BytesIO(data), encoding=encoding)

This could be used as an additional encoding='auto' case - or even in the 'None' case instead of the current default - inside pandas directly. We don't know whether this auto detecting might fail in some cases, however it does a much better job than the current default decoding. Therefore, we would like to propose this feature.

Alternative Solutions

Alternatively, explicitly defining the right encoding is required to avoid UnicodeDecodeErrors.

Additional Context

No response

Comment From: mraza007

This seems pretty useful enhancement. I have often encountered this and usually the solution is set the encoding when reading the file

Comment From: ChillarAnand

Any plans on integrating this?

Comment From: mraza007

I can work on implementing this

If no one has worked on it so far

Comment From: CSBVision

We think the required code is already there, the main question is where this should be implemented? And should it replace the current default None case? We think it should, but unluckily no Pandas dev commented on this proposal so far. So maybe just create a PR from our code and see whether it get's merged? We are fine with that 👍

Comment From: mraza007

Sounds good to me and I'll look into this