Feature Type
-
[X] Adding new functionality to pandas
-
[ ] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
Our proposal improves the robustness of pandas' text importers, in particular the read_csv()
function. Currently, an explicit encoding can be set or it defaults to None
, which seems to be resolved to 'utf-8'
, but maybe this is platform-specific. Unluckily, csv files often come with different encodings. For example, Excel does not use UTF-8 by default and often users do not really care about encodings while saving such that we have to handle different file encondings. Unluckily, pandas raises UnicodeDecodeErrors
if something else than 'utf-8'
is required, even though text editors automatically detect the right encoding.
Feature Description
Several resources suggest to automatically detect the right enconding using chardet.detect()
. Using this, the following code successfully recognized the right encoding in our experiments ('utf-8' or 'ISO-8859-1'):
import chardet
import io
filename = 'path/to/some/file.csv' # source file
encoding = None # encoding can be predefined or not
with open(filename, 'rb') as file:
data = file.read()
if encoding is None: # if not explicitly given, this line detects the right encoding
encoding = chardet.detect(data)['encoding']
pd.read_csv(io.BytesIO(data), encoding=encoding)
This could be used as an additional encoding='auto'
case - or even in the 'None' case instead of the current default - inside pandas directly. We don't know whether this auto detecting might fail in some cases, however it does a much better job than the current default decoding. Therefore, we would like to propose this feature.
Alternative Solutions
Alternatively, explicitly defining the right encoding is required to avoid UnicodeDecodeErrors
.
Additional Context
No response
Comment From: mraza007
This seems pretty useful enhancement. I have often encountered this and usually the solution is set the encoding when reading the file
Comment From: ChillarAnand
Any plans on integrating this?
Comment From: mraza007
I can work on implementing this
If no one has worked on it so far
Comment From: CSBVision
We think the required code is already there, the main question is where this should be implemented? And should it replace the current default None
case? We think it should, but unluckily no Pandas dev commented on this proposal so far. So maybe just create a PR from our code and see whether it get's merged? We are fine with that 👍
Comment From: mraza007
Sounds good to me and I'll look into this