pd.concat
has a specific behavior to always sort DatetimeIndex
when join="outer"
and the non-concatentation axes are not aligned. This was undocumented (prior to 2.2.1) and is inconsistent with all other index types and data types including other temporal types such as pyarrow timestamps/dates/times.
An attempt to treat this as a bug fix highlighted that this has been long-standing behavior that users may be accustomed to. (xref #57006)
Here are two options that come to mind:
-
Deprecate the existing behavior and do not sort by default for
DatetimeIndex
. This would simplify things by removing the carve out and treating all index types / data types consistently. This would require users to explicitly passsort=True
when concatenating two frames with monotonic indexes if they wanted to ensure a monotonic result. -
Always sort the non-concatenation axis when
join="outer"
(for all dtypes). This would be consistent with howpd.merge
andDataFrame.join
handle outer merges and in practice may be more useful behavior since concatenating two frames with monotonic indexes will return a frame with a monotonic index.
Comment From: rhshadrach
+1 on (1) - different sorting defaults for different dtypes is very surprising.
For (2), I do prefer concat's way of maintaining order by starting with the first frame and tacking on any new labels to the end as it proceeds through the list. When this order is desired, I think this behavior is harder for the user to replicate.
Comment From: torext
My vote very much goes to option 2. Setting sort=False
then should behave the way @rhshadrach describes his preferred concat behaviour. Though I really must say, I don't quite see how often that is useful in real-word examples. Given how much time series-related functionality is built into Pandas I'm surprised option 2 isn't considered the more reasonable use case. In all the applications I see Pandas used it's very rare to have a DataFrame with both "label-style" index and columns; usually one of the two is rather a time dimension. Maybe @rhshadrach can provide concrete real-world examples where the non-sorting behaviour is desireable on a true outer join? I personally can't think of any good ones...
I'm more than happy however to provide a slew of examples of concat
usage where one of the axes represents a time dimension and, consequently, maintaining sorting on an outer join along the other axis is paramount. Maybe however I'm using the wrong tool? I.e. should I be using merge
in those cases?
Comment From: rhshadrach
In all the applications I see Pandas used it's very rare to have a DataFrame with both "label-style" index and columns; usually one of the two is rather a time dimension.
I always use "label-style" index and columns.
Maybe @rhshadrach can provide concrete real-world examples where the non-sorting behaviour is desireable on a true outer join?
Any DataFrame where the order of the rows/columns contains information, such as a sequence of events.
I'm more than happy however to provide a slew of examples of
concat
usage where one of the axes represents a time dimension and, consequently, maintaining sorting on an outer join along the other axis is paramount.
And that is still quite easy to attain: df.sort_index(axis="index")
and df.sort_index(axis="columns")
. However, if we sorted the non-concatenating index by default, how would users get the alternative order back? That isn't so easy.
Comment From: torext
However, if we sorted the non-concatenating index by default, how would users get the alternative order back? That isn't so easy.
By using the sort=False
argument in pd.concat
, no? Indeed I'm not saying the sort kwarg should be ignored as it used to be, I'm just arguing that the default of sorting seems very reasonable to me in order to keep a vast majority of code short and sweet.
The label-only cases surely are out there, but given how much timeseries functionality has been built into Pandas I tend to be convinced that many usecases these days have a time dimension on one of the axes. For what it's worth, essentially all of mine do.
Are you suggesting that otherwise anyone working almost exclusively with timeseries-like data will always have to invoke pd.concat with subsequent explicit sort calls? That seems rather verbose and error-prone to me. There surely then must be a better way to safely assemble DataFrames in such a setting; maybe concat is not the right thing to use?
Comment From: rhshadrach
By using the
sort=False
argument inpd.concat
, no?
Perhaps I'm misunderstanding. @lukemanley opened this issue with the line:
pd.concat
has a specific behavior to always sortDatetimeIndex
whenjoin="outer"
and the non-concatentation axes are not aligned.
Then presented two options to rectify the inconsistency. The first is to treat DatetimeIndex
as all other dtypes are treated today. The second was:
Always sort the non-concatenation axis when join="outer" (for all dtypes).
and this is the one you agreed with in https://github.com/pandas-dev/pandas/issues/57335#issuecomment-1946115289. Perhaps "always" doesn't mean what I think it means?
Are you suggesting that otherwise anyone working almost exclusively with timeseries-like data will always have to invoke pd.concat with subsequent explicit sort calls?
No - sort=True
will still sort.
Comment From: rhshadrach
Rereading your previous comment, it seems to me you're advocating changing the default of sort
in concat to True
across the board for all dtypes. If that's the case, I think it should be discussed in a separate issue - this one is about rectifying a specific inconsistency with DatetimeIndex.
If, on the other hand, you're advocating for a dtype-dependent sorting default (sort=True
for DatetimeIndex, False
for other dtypes), then I think this is too complex and surprising of a default.
Comment From: torext
Perhaps "always" doesn't mean what I think it means?
Yeah was maybe not being clear here: by "always" I really meant when sort=True
and to have that as the default, at least in the DateTimeIndex
case.
If, on the other hand, you're advocating for a dtype-dependent sorting default (sort=True for DatetimeIndex, False for other dtypes), then I think this is too complex and surprising of a default.
Yeah I can see your point re. surprising behaviour and complexity here. I'd be quite content with sort=True
by default across the board indeed. As I mentioned, .groupby()
already has sort=True
by default.
No - sort=True will still sort.
Sure, but unless one forgets to call that it will end up unsorted leading likely to failures that give erroneous results but without throwing an error, i.e. the worst kind of bug (again, coming form the timeseries perspective here). Putting what I'm saying differently: don't you think that sorting by default is safer than not sorting by default? Sure, the former might have unnecessary overheard in some cases, but those can be solved by then explicitly setting sort=False
; the same goes for the cases where one wants to "preserve" order (i.e. the appending behaviour).
Come to think of it, I would actually also argue that the behaviour of appending indices on the non-concatenation axis is not 100% transparent either: if concat
is called on e.g. a dictionary, then one tends to think of that as an unsorted collection of elements to be concatenated, yet their implicit order will inlfuence the outcome of the non-concatenation axis when sort=False
. You see what I mean?
Comment From: rhshadrach
I'd be quite content with
sort=True
by default across the board indeed. As I mentioned,.groupby()
already hassort=True
by default.
As mentioned above, I recommend opening a new issue if you'd like to pursue changing the default value of sort.
Comment From: jbrockmendel
@rhshadrach is this actionable for 3.0?
Comment From: rhshadrach
I'm +1 on calling this a bugfix and having all dtypes respect the sort argument. -1 on changing sort defaults without a proposal that considers the entire API.