Pandas DEPR: pd.concat special cases DatetimeIndex to sort even when sort=False

pd.concat has a specific behavior to always sort DatetimeIndex when join="outer" and the non-concatentation axes are not aligned. This was undocumented (prior to 2.2.1) and is inconsistent with all other index types and data types including other temporal types such as pyarrow timestamps/dates/times.

An attempt to treat this as a bug fix highlighted that this has been long-standing behavior that users may be accustomed to. (xref #57006)

Here are two options that come to mind:

Deprecate the existing behavior and do not sort by default for DatetimeIndex. This would simplify things by removing the carve out and treating all index types / data types consistently. This would require users to explicitly pass sort=True when concatenating two frames with monotonic indexes if they wanted to ensure a monotonic result.
Always sort the non-concatenation axis when join="outer" (for all dtypes). This would be consistent with how pd.merge and DataFrame.join handle outer merges and in practice may be more useful behavior since concatenating two frames with monotonic indexes will return a frame with a monotonic index.

Comment From: rhshadrach

+1 on (1) - different sorting defaults for different dtypes is very surprising.

For (2), I do prefer concat's way of maintaining order by starting with the first frame and tacking on any new labels to the end as it proceeds through the list. When this order is desired, I think this behavior is harder for the user to replicate.

Comment From: torext

My vote very much goes to option 2. Setting sort=False then should behave the way @rhshadrach describes his preferred concat behaviour. Though I really must say, I don't quite see how often that is useful in real-word examples. Given how much time series-related functionality is built into Pandas I'm surprised option 2 isn't considered the more reasonable use case. In all the applications I see Pandas used it's very rare to have a DataFrame with both "label-style" index and columns; usually one of the two is rather a time dimension. Maybe @rhshadrach can provide concrete real-world examples where the non-sorting behaviour is desireable on a true outer join? I personally can't think of any good ones...

I'm more than happy however to provide a slew of examples of concat usage where one of the axes represents a time dimension and, consequently, maintaining sorting on an outer join along the other axis is paramount. Maybe however I'm using the wrong tool? I.e. should I be using merge in those cases?

Comment From: rhshadrach

In all the applications I see Pandas used it's very rare to have a DataFrame with both "label-style" index and columns; usually one of the two is rather a time dimension.

I always use "label-style" index and columns.

Maybe @rhshadrach can provide concrete real-world examples where the non-sorting behaviour is desireable on a true outer join?

Any DataFrame where the order of the rows/columns contains information, such as a sequence of events.

I'm more than happy however to provide a slew of examples of concat usage where one of the axes represents a time dimension and, consequently, maintaining sorting on an outer join along the other axis is paramount.

And that is still quite easy to attain: df.sort_index(axis="index") and df.sort_index(axis="columns"). However, if we sorted the non-concatenating index by default, how would users get the alternative order back? That isn't so easy.

Comment From: torext

However, if we sorted the non-concatenating index by default, how would users get the alternative order back? That isn't so easy.

By using the sort=False argument in pd.concat, no? Indeed I'm not saying the sort kwarg should be ignored as it used to be, I'm just arguing that the default of sorting seems very reasonable to me in order to keep a vast majority of code short and sweet.

The label-only cases surely are out there, but given how much timeseries functionality has been built into Pandas I tend to be convinced that many usecases these days have a time dimension on one of the axes. For what it's worth, essentially all of mine do.

Are you suggesting that otherwise anyone working almost exclusively with timeseries-like data will always have to invoke pd.concat with subsequent explicit sort calls? That seems rather verbose and error-prone to me. There surely then must be a better way to safely assemble DataFrames in such a setting; maybe concat is not the right thing to use?

Comment From: rhshadrach

By using the sort=False argument in pd.concat, no?

Perhaps I'm misunderstanding. @lukemanley opened this issue with the line:

pd.concat has a specific behavior to always sort DatetimeIndex when join="outer" and the non-concatentation axes are not aligned.

Then presented two options to rectify the inconsistency. The first is to treat DatetimeIndex as all other dtypes are treated today. The second was:

Always sort the non-concatenation axis when join="outer" (for all dtypes).

and this is the one you agreed with in https://github.com/pandas-dev/pandas/issues/57335#issuecomment-1946115289. Perhaps "always" doesn't mean what I think it means?

Are you suggesting that otherwise anyone working almost exclusively with timeseries-like data will always have to invoke pd.concat with subsequent explicit sort calls?

No - sort=True will still sort.

Comment From: rhshadrach

Rereading your previous comment, it seems to me you're advocating changing the default of sort in concat to True across the board for all dtypes. If that's the case, I think it should be discussed in a separate issue - this one is about rectifying a specific inconsistency with DatetimeIndex.

If, on the other hand, you're advocating for a dtype-dependent sorting default (sort=True for DatetimeIndex, False for other dtypes), then I think this is too complex and surprising of a default.

Comment From: torext

Perhaps "always" doesn't mean what I think it means?

Yeah was maybe not being clear here: by "always" I really meant when sort=True and to have that as the default, at least in the DateTimeIndex case.

If, on the other hand, you're advocating for a dtype-dependent sorting default (sort=True for DatetimeIndex, False for other dtypes), then I think this is too complex and surprising of a default.

Yeah I can see your point re. surprising behaviour and complexity here. I'd be quite content with sort=True by default across the board indeed. As I mentioned, .groupby() already has sort=True by default.

No - sort=True will still sort.

Sure, but unless one forgets to call that it will end up unsorted leading likely to failures that give erroneous results but without throwing an error, i.e. the worst kind of bug (again, coming form the timeseries perspective here). Putting what I'm saying differently: don't you think that sorting by default is safer than not sorting by default? Sure, the former might have unnecessary overheard in some cases, but those can be solved by then explicitly setting sort=False; the same goes for the cases where one wants to "preserve" order (i.e. the appending behaviour).

Come to think of it, I would actually also argue that the behaviour of appending indices on the non-concatenation axis is not 100% transparent either: if concat is called on e.g. a dictionary, then one tends to think of that as an unsorted collection of elements to be concatenated, yet their implicit order will inlfuence the outcome of the non-concatenation axis when sort=False. You see what I mean?

Comment From: rhshadrach

I'd be quite content with sort=True by default across the board indeed. As I mentioned, .groupby() already has sort=True by default.

As mentioned above, I recommend opening a new issue if you'd like to pursue changing the default value of sort.

Comment From: jbrockmendel

@rhshadrach is this actionable for 3.0?

Comment From: rhshadrach

I'm +1 on calling this a bugfix and having all dtypes respect the sort argument. -1 on changing sort defaults without a proposal that considers the entire API.

Comment From: jbrockmendel

It looks like #55085 removed the special-casing here, but there were enough complaints (#57006) that it was reverted in #57250. Maybe a breaking change is more palatable in a major release?

The overriding of the keyword is done in union_indexes, which is reached from the concat code but also from internals.construction, so deprecating that behavior may be tough; hard to give a user-facing message when reachable via multiple APIs.