Feature Type
-
[ ] Adding new functionality to pandas
-
[X] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
As a User, I am reading in CSV files with mixed types in certain columns. Specifically, the column can have a value, or a file path to a list of values. With all current quoting flags, the df.read_csv turns this into an object column where the floats or ints are quoted as well.
Thus, turning all nrs into numerics while keeping all file paths as strings is easily done as follows:
df[column_name] = df[column_name].apply(_pd.to_numeric, errors="ignore")
Thus, as the solution above is being sunsetted, I wish pandas could provide a simple solution to get object columns with mixed dtypes which treat nrs as numerics from csv files.
Feature Description
usage: pd.to_numeric(preserve_text=True)
implementation
Alternative Solutions
usage: csv_read(to_numeric_in_mixed_columns=True, preserve_text_in_mixed_columns=True)
Additional Context
https://github.com/pandas-dev/pandas/issues/54467 https://github.com/pandas-dev/pandas/issues/43280 https://github.com/pypest/pyemu/pull/485
The proposed workaround in the pypest issue is not satisfactory as this should be a common occurrence with a common solution.
Comment From: mepearson
upvote for this issue
Comment From: snitish
@mroeschke this looks like a legitimate use case of the errors="ignore"
feature of to_numeric
that was deprecated in v2.2
. Any ideas on how we can add this functionality back, either within to_numeric
or as a new function? So far the only acceptable workaround seems to be this, but it's not particularly clean.
Comment From: ahobeost
Thanks @snitish.
Two suggestions:
- The User could provide allowed types to to_numeric
:
_pd.to_numeric(..., allowed_types=[float, datetime])
- Another function could be built which inherently deals with mixed columns: ``` _pd.to_mixed(..., expected_types=[int, float, datetime] _pd.interpret_as_mixed_column() _pd.allow_expected_types()
Comment From: ahobeost
Based on some initial tests with df.convert_dtypes
, the two suggested options will likely not work.
Thus I add a third suggestion:
_pd.to_numeric(..., keep_original_strings=True)
If an error occurs, it keeps the original string for that entry.
Comment From: mepearson
Based on some initial tests with
df.convert_dtypes
, the two suggested options will likely not work.Thus I add a third suggestion:
_pd.to_numeric(..., keep_original_strings=True)
If an error occurs, it keeps the original string for that entry.
This would work for us and is the behavior we are using in our use case.
For reference, the use case is a .csv that serves as a controlled vocabulary for defined terms in order to map between fields-values-UI display terms. It needs to be centralized in a single file as the actual values/UI meanings are reviewed/maintained by SME vs. developers
The code in question creates a dict that splits this apart, and could potentially be handled with an if/then since within a given field things are either all numeric or all not. But the 'if this is numeric, make it numeric, and if not leave it as a string' is the intended behavior as currently written.
Current function in use:
def get_terms_dictionary(display_terms, api_field, api_value, display_col):
'''from a dataframe with the table display information, create a dictionary by field to match the database
value to a value for use in the UI '''
try:
display_terms_list = display_terms[api_field].unique() # List of fields with matching display terms
# Create a dictionary using the field as the key, and the dataframe to map database values to display text as the value
display_terms_dict = {}
for i in display_terms_list:
term_df = display_terms[display_terms.api_field == i]
term_df = term_df[[api_value,display_col]]
term_df = term_df.rename(columns={api_value: i, display_col: i + '_display'})
term_df = term_df.apply(pd.to_numeric, errors='ignore')
display_terms_dict[i] = term_df
return display_terms_dict
except Exception as e:
traceback.print_exc()
return None```
**Comment From: ahobeost**
Based on the comment by @mepearson, I add our example here.
In [optihood](https://github.com/SPF-OST/optihood), the User provides a lot of information related to energy network for neighborhood of buildings.
As part of this information, some columns are either constant values, or string paths to a csv file.
Thus, the number should be changed to an int or float, and the string should remain.

https://github.com/SPF-OST/optihood/blob/main/data/CSVs/basic_example_CSVs/commodity_sources.csv
To complicate things, the cells can also be empty.

https://github.com/SPF-OST/optihood/blob/main/data/CSVs/basic_example_CSVs/buses.csv
It would be nice if these empty cells remain interpreted as `NaN`.
**Comment From: mroeschke**
Thanks for the suggestion, but I would be -1 on modifying `to_numeric` in this way.
Although it's not "clean" just to do
def parse_numbers(x): try: return pd.to_numeric(x) except Exception: return x
df[col].apply(parse_numbers) ```
This behavior was intentionally deprecated in pandas 2.x to avoid potentially returning object
columns which don't have great support and poor performance characteristics.