Pandas version checks

  • [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [x] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({'doubleByteCol': ['§'*1500]})
df.to_stata('temp.dta', version=118)

len_encoded = df['doubleByteCol'].str.encode('utf-8').str.len()     # _encode_strings() count = 3000 -> no byte encoding because assumed will become strL (stata.py:2694)
len_typlist = df['doubleByteCol'].str.len()                         # _dtype_to_stata_type() = 1500 -> typ 1500 (stata.py:2193)
len_typlist < 2045      # True -> Tries to convert to np dtype S1500, but fails because unicode characters are not supported (normally no issue because encoded to bytes first) (stata.py:2945,2956)

Issue Description

The StataWriter uses two different versions of the string column to check the same thing. During _encode_strings() it checks the length of the byte-encoded column max_len_string_array(ensure_object(encoded._values)) but when assigning numpy types it checks the (potentially) unencoded version itemsize = max_len_string_array(ensure_object(column._values)). This then trips up the _prepare_data() section, which expects short columns to be byte-encoded already typ <= self._max_string_length based on the reported type, which is not true if the encoded column > 2045 due to unicode characters such as § taking up two bytes.

Expected Behavior

I don't know the internal workings of stata.py well enough to be sure, but I think the easiest fix is using the actual values when checking str length in _encode_strings(). That is, replace max_len_string_array(ensure_object(encoded._values)) by max_len_string_array(ensure_object(self.data[col]._values))

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.1.final.0 python-bits : 64 OS : Windows OS-release : 11 Version : 10.0.26100 machine : AMD64 processor : AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : English_Belgium.1252 pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : None pip : 23.2.1 Cython : None pytest : 8.3.5 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.4 html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 15.0.2 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

Temporary fix

For users finding this topic, this refers to the following Exception

Exception has occurred: UnicodeEncodeError       (note: full exception trace is shown but execution is paused at: <module>)
'ascii' codec can't encode characters in position 0-1499: ordinal not in range(128)
  File "F:\datatog\junkyard\adhoc-scripts\mwes\pandas_asciiencoding.py", line 4, in <module> (Current frame)
    df.to_stata('temp.dta', version=118)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1499: ordinal not in range(128)

You can workaround this issue by explicitly specifying the offending columns in the convert_strL option.

Comment From: eicchen

take