This is a doozy:

I'm working through the issues with the setitem-with-expansion tag and about half of them are of the form "i added a row and got unwanted casting." Most of those are int->float, some are EA->object. At least one is object->non_object (when the original is empty we special-case).

Some of these (df.loc[new_row] = values) go through a path that uses concat. Others (df.loc[new_row, :] = values go through a path that reindexes instead. These two paths can give subtly different behaviors:

df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df2 = df.copy()

df.loc[2] = 5
df2.loc[2, :] = 5

assert (df.dtypes == "int64").all()
assert (df2.dtypes == "float64").all()

I have a branch that updates the concat-codepath to call _cast_pointwise_result on the new values to try to cast to the column's original dtype, but there are some problems there:

1) Doing that without patching the reindex path introduces even more inconsistencies 2) We have exactly one test case (test_partial_setting) that seems to be expecting non-casting:

ser = pd.Series([1, 2, 3])
ser.loc[5] = 5.0
expected = Series([1, 2, 3, 5.0], index=[0, 1, 2, 5], dtype="float64")
tm.assert_series_equal(ser, expected)

3) The elegant solution of using cast_pointwise_result requires adding to NumpyEA._cast_pointwise_result and MaskedEA._cast_pointwise_result checks for "if the original is integer and the result is all-round floats, cast back to ints". This change affects a bunch of map/apply/combine behavior that may not be desired. Of course we could put those checks outside the _cast_pointwise_result calls, they just become much less elegant.

In the background there is also the consideration that I'm hoping that in 4.0 we will get to nullable-by-default, in which case the reindex codepath will not do any casting, so the issue will fix itself for those cases. Plausibly we could refactor to always go down that path and get consistency that way.

ATM I'm inclined to say that the test referenced in point 2) is wrong/undesired, and the setitem-with-expansion cases should behave like the non-expansion cases as much as possible (i.e. it will cast round floats to ints).

I'm taking a look at what it would take to patch the reindexing codepath with the same _cast_pointwise_result logic as the concat codepath, but i think its a bit more involved. Assuming that isn't feasible, should we try to fix the concat codepaths alone?

Update Two more differences in the reindex path are that a) we get the PDEP-6 disallowing of extra casting behavior and b) we don't get the special-casing of empty cases.

Comment From: jbrockmendel

Not time-sensitive, but I will ask @mroeschke, @jorisvandenbossche, @rhshadrach for opinions on this at the next dev call.

Comment From: rhshadrach

I often can't make the dev call these days, so leaving some comments here. Very positive on the overall goal.

ATM I'm inclined to say that the test referenced in point 2) is wrong/undesired

+1, especially in light of PDEP-6.

... "if the original is integer and the result is all-round floats, cast back to ints". This change affects a bunch of map/apply/combine behavior that may not be desired.

I do fairly strongly believe that if a user gives us a UDF in e.g. map that returns floats on int data, the result should be float.

Comment From: jorisvandenbossche

Regarding the exact casting behaviour, it's maybe good to consider the different cases:

  1. Setting with a same-dtype value (e.g. setting an int to an integer column, which currently can get cast to float because of the reindexing behaviour) -> I think this is a no-brainer that the dtype should be preserved (and the current behaviour is a bug / side-effect of reindexing non-nullable ints)
  2. Setting with a different dtype value, but where a cast could be lossless (e.g. setting with a round float to an integer column, i.e. the cases we still allow in normal setitem) -> this is the main discussion point raised above, I think? Personally I can see a case for both ways, essentially depending on whether you think about the operation as a reindex-then-set or as a concat of the object with the value being set (and then the common dtype rules apply). But given we are still in a setitem indexing operation, I am fine with going for the reindex+setitem casting logic.
  3. Setting with a incompatible value -> this currently works fine and upcasts (eg int to float, or typically to object dtype)

I am assuming we want to keep the third case working. But if so, that also means that a pure "reindex+setitem" logic (which could resolve the inconsistency once we have all nullable dtypes, as you mentioned) wouldn't work for that case (because the setitem step no longer allows upcasting). We could of course still upcast when setitem fails, but that also introduces some inconsistencies between cases 2 and 3 (like df.loc[2] = 5.0 vs df.loc[2] = 5.1 would give different dtypes, even though both values are floats)

... "if the original is integer and the result is all-round floats, cast back to ints". This change affects a bunch of map/apply/combine behavior that may not be desired.

I do fairly strongly believe that if a user gives us a UDF in e.g. map that returns floats on int data, the result should be float.

+1

Comment From: jbrockmendel

I am assuming we want to keep the third case [setitem with incompatible value] working

I'd be inclined to not allow this and apply PDEP6 rules even with expansion. Not implacably opposed.

Mostly I'd like to get a single code path with consistent behavior. In a nullable-by-default world I'd expect the reindex path to Just Work for free, so I'm inclined to make that the path that we retain.

Comment From: rhshadrach

2. Setting with a different dtype value, but where a cast could be lossless (e.g. setting with a round float to an integer column, i.e. the cases we still allow in normal setitem) -> this is the main discussion point raised above, I think? Personally I can see a case for both ways

I think we should always be strict here for consistency. I think as long as there are some places where we need to differentiate 3 vs 3.0, then we should do it across the board. Otherwise users will have a hard time predicting where it matters and where it doesn't.

I am assuming we want to keep the third case working.

I do not want this to work - it'd be much safer to have users forced to cast to object or whatever parent type they desire. It is maybe harder for a user on small data who doesn't care, but it makes pandas safer to use.