Looking at the keywords for convert_dtypes I'm wondering if users actually want it for anything other than dtype_backend?

Comment From: Aniketsy

I would like to work on this issue. Could you please provide more detailed instructions or clarify the specific changes you’re looking for? Thankyou !

Comment From: rhshadrach

It seems to me there are two situations that a user may face:

  1. "I want to find the best dtype to hold my data". For this, I think the keywords e.g. convert_integer makes sense.
  2. "I want to to take my data and convert all the dtypes to the corresponding pyarrow dtype ". For this I would not have the keywords.

Perhaps these should be separate functions?

I personally think (1) is of questionable use: it's value-specific behavior, and users would get better behavior by converting the data themselves. Still, in ad-hoc analysis type situations I can see the convenience and am okay with keeping it. (2) on the other hand seems highly desirable.

Comment From: jorisvandenbossche

FWIW, this method was originally added to convert to nullable dtypes, not specifically to arrow-backed dtypes (that is also still the default behaviour).

And I think one of the reasons that we added those keywords initially is that people might eg only wanted to use the nullable integer dtype (because that adds more value) and not necessarily the nullable float or string. At the moment that those dtypes were introduced (experimentally), I think those keywords made sense. But that is less the case right now (and indeed even less so if you specify dtypes_backend)

  1. "I want to find the best dtype to hold my data". For this, I think the keywords e.g. convert_integer makes sense.

I don't think that this function actually does that? Except for going from object dtype to a better dtype. For that we also already have a dedicated method df.infer_objects(), but convert_dtypes() was added on top of that to convert to nullable.

But for the rest, the function only converts from non-nullable to nullable, it won't actaully "optimize" your data types (for example, it won't try to downcast to a smaller bitsize if possible, like one can do in pd.to_numeric). (there is of course the specific case of casting rounded floats to integer, but that is tied to the aspect of converting to nullable dtypes)

Comment From: rhshadrach

Thanks @jorisvandenbossche - makes sense. I think I got these mixed up, especially with the convert_dtypes docstring being:

Convert columns to the best possible dtypes using dtypes supporting pd.NA.

I'm definitely good with deprecation of the convert_* arguments here. However the infer_objects=True option seems like it could be better suited for just using DataFrame.infer_objects if we were to add the possibility of converting to NumPy-nullable / PyArrow dtypes directly there.