Feature Type

  • [ ] Adding new functionality to pandas

  • [X] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

As a User, I am reading in CSV files with mixed types in certain columns. Specifically, the column can have a value, or a file path to a list of values. With all current quoting flags, the df.read_csv turns this into an object column where the floats or ints are quoted as well.

Thus, turning all nrs into numerics while keeping all file paths as strings is easily done as follows: df[column_name] = df[column_name].apply(_pd.to_numeric, errors="ignore")

Thus, as the solution above is being sunsetted, I wish pandas could provide a simple solution to get object columns with mixed dtypes which treat nrs as numerics from csv files.

Feature Description

usage: pd.to_numeric(preserve_text=True)

implementation

Alternative Solutions

usage: csv_read(to_numeric_in_mixed_columns=True, preserve_text_in_mixed_columns=True)

Additional Context

https://github.com/pandas-dev/pandas/issues/54467 https://github.com/pandas-dev/pandas/issues/43280 https://github.com/pypest/pyemu/pull/485

The proposed workaround in the pypest issue is not satisfactory as this should be a common occurrence with a common solution.

Comment From: mepearson

upvote for this issue

Comment From: snitish

@mroeschke this looks like a legitimate use case of the errors="ignore" feature of to_numeric that was deprecated in v2.2. Any ideas on how we can add this functionality back, either within to_numeric or as a new function? So far the only acceptable workaround seems to be this, but it's not particularly clean.

Comment From: ahobeost

Thanks @snitish. Two suggestions: - The User could provide allowed types to to_numeric: _pd.to_numeric(..., allowed_types=[float, datetime])

  • Another function could be built which inherently deals with mixed columns: ``` _pd.to_mixed(..., expected_types=[int, float, datetime] _pd.interpret_as_mixed_column() _pd.allow_expected_types()

Comment From: ahobeost

Based on some initial tests with df.convert_dtypes, the two suggested options will likely not work.

Thus I add a third suggestion: _pd.to_numeric(..., keep_original_strings=True)

If an error occurs, it keeps the original string for that entry.

Comment From: mepearson

Based on some initial tests with df.convert_dtypes, the two suggested options will likely not work.

Thus I add a third suggestion: _pd.to_numeric(..., keep_original_strings=True)

If an error occurs, it keeps the original string for that entry.

This would work for us and is the behavior we are using in our use case.

For reference, the use case is a .csv that serves as a controlled vocabulary for defined terms in order to map between fields-values-UI display terms. It needs to be centralized in a single file as the actual values/UI meanings are reviewed/maintained by SME vs. developers

The code in question creates a dict that splits this apart, and could potentially be handled with an if/then since within a given field things are either all numeric or all not. But the 'if this is numeric, make it numeric, and if not leave it as a string' is the intended behavior as currently written.

Current function in use:

def get_terms_dictionary(display_terms, api_field, api_value, display_col):
    '''from a dataframe with the table display information, create a dictionary by field to match the database
    value to a value for use in the UI '''
    try:
        display_terms_list = display_terms[api_field].unique() # List of fields with matching display terms

        # Create a dictionary using the field as the key, and the dataframe to map database values to display text as the value
        display_terms_dict = {}
        for i in display_terms_list:
            term_df = display_terms[display_terms.api_field == i]
            term_df = term_df[[api_value,display_col]]
            term_df = term_df.rename(columns={api_value: i, display_col: i + '_display'})
            term_df = term_df.apply(pd.to_numeric, errors='ignore')
            display_terms_dict[i] = term_df
        return display_terms_dict

    except Exception as e:
        traceback.print_exc()
        return None```

**Comment From: ahobeost**

Based on the comment by @mepearson, I add our example here.
In [optihood](https://github.com/SPF-OST/optihood), the User provides a lot of information related to energy network for neighborhood of buildings.
As part of this information, some columns are either constant values, or string paths to a csv file.
Thus, the number should be changed to an int or float, and the string should remain.

![Image](https://github.com/user-attachments/assets/3f580c25-fe7f-4afd-89cc-cd21f3061c07)
https://github.com/SPF-OST/optihood/blob/main/data/CSVs/basic_example_CSVs/commodity_sources.csv

To complicate things, the cells can also be empty.

![Image](https://github.com/user-attachments/assets/5b017ef5-96ba-4310-931b-ca55b3e3ace6)
https://github.com/SPF-OST/optihood/blob/main/data/CSVs/basic_example_CSVs/buses.csv

It would be nice if these empty cells remain interpreted as `NaN`.

**Comment From: mroeschke**

Thanks for the suggestion, but I would be -1 on modifying `to_numeric` in this way.

Although it's not "clean" just to do 

def parse_numbers(x): try: return pd.to_numeric(x) except Exception: return x

df[col].apply(parse_numbers) ```

This behavior was intentionally deprecated in pandas 2.x to avoid potentially returning object columns which don't have great support and poor performance characteristics.