Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame({'doubleByteCol': ['§'*1500]})
df.to_stata('temp.dta', version=118)
len_encoded = df['doubleByteCol'].str.encode('utf-8').str.len() # _encode_strings() count = 3000 -> no byte encoding because assumed will become strL (stata.py:2694)
len_typlist = df['doubleByteCol'].str.len() # _dtype_to_stata_type() = 1500 -> typ 1500 (stata.py:2193)
len_typlist < 2045 # True -> Tries to convert to np dtype S1500, but fails because unicode characters are not supported (normally no issue because encoded to bytes first) (stata.py:2945,2956)
Issue Description
The StataWriter uses two different versions of the string column to check the same thing. During _encode_strings() it checks the length of the byte-encoded column max_len_string_array(ensure_object(encoded._values))
but when assigning numpy types it checks the (potentially) unencoded version itemsize = max_len_string_array(ensure_object(column._values))
. This then trips up the _prepare_data() section, which expects short columns to be byte-encoded already typ <= self._max_string_length
based on the reported type, which is not true if the encoded column > 2045 due to unicode characters such as §
taking up two bytes.
Expected Behavior
I don't know the internal workings of stata.py well enough to be sure, but I think the easiest fix is using the actual values when checking str length in _encode_strings(). That is, replace
max_len_string_array(ensure_object(encoded._values))
by
max_len_string_array(ensure_object(self.data[col]._values))
Installed Versions
Temporary fix
For users finding this topic, this refers to the following Exception
Exception has occurred: UnicodeEncodeError (note: full exception trace is shown but execution is paused at: <module>)
'ascii' codec can't encode characters in position 0-1499: ordinal not in range(128)
File "F:\datatog\junkyard\adhoc-scripts\mwes\pandas_asciiencoding.py", line 4, in <module> (Current frame)
df.to_stata('temp.dta', version=118)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1499: ordinal not in range(128)
You can workaround this issue by explicitly specifying the offending columns in the convert_strL
option.
Comment From: eicchen
take