Pandas BUG: should constructing Index from a Series make a copy?

From a comment of @jbrockmendel at https://github.com/pandas-dev/pandas/pull/41878#issuecomment-881557871:

ser = pd.Series(range(5))
idx = pd.Index(ser)
ser[0] = 10
>>> idx[0]
10

In the above, we create an Index from a Series, then mutate the Series, which also updated the Index, while an Index is assumed to be immutable.

Changing the example a bit, you can obtain wrong values with indexing this way:

ser = pd.Series(range(5))
idx = pd.Index(ser)

ser.index = idx

>>> ser[0]
0
>>> ser.iloc[0] = 10
>>> ser[0]
10
>>> ser
10    10
1      1
2      2
3      3
4      4
dtype: int64

So ser[0] is still giving a result, while that key doesn't actually exist in the Series' index at that point.

I know that generally we consider this a user error if you would do this with a numpy array (idx = pd.Index(arr) and mutating the array), but here you get that by only using high-level pandas objects itself. In which case we should prevent this from happening?

Comment From: jorisvandenbossche

A similar example but with a DataFrame and set_index (so even without explicitly doing Index(ser)) that has the same problem, as a good illustration that IMO this is not a "user error" but something we should fix on pandas' side:

In [33]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [0.1, 0.2, 0.3]})

In [36]: df.set_index("a", drop=False, inplace=True)

In [37]: df
Out[37]: 
   a  b    c
a           
1  1  4  0.1
2  2  5  0.2
3  3  6  0.3

In [38]: df.loc[1, 'a']
Out[38]: 1

In [39]: df.iloc[0, 0] = 10

In [40]: df
Out[40]: 
     a  b    c
a             
10  10  4  0.1
2    2  5  0.2
3    3  6  0.3

In [41]: df.loc[1, 'a']
Out[41]: 10

Comment From: attack68

This probably doesn't fit in this thread but just for consideration alongside this issue is that while index is not mutable index.names is. Recently this threw me in the following case:

idx = pd.Index(["a", "b"])
df = pd.DataFrame([[1,2],[3,4]], columns=idx, index=idx)
df.index.names = ["zzz"]

zzz  a  b
zzz
a    1  2
b    3  4

I wasn't expecting the columns' names to be changed, I intuitively expected a new index object was created by the constructor.

I agree your examples should be corrected.

Comment From: jorisvandenbossche

Hmm, since you are explicitly passing the same Index object as columns and rows index, this can maybe be considered as expected behaviour. Not really sure .. (but indeed a different issue).

Comment From: rhshadrach

Since properties of the index are assumed immutable, they are cached, and this can lead to invalid states:

ser = pd.Series(range(2))
idx = pd.Index(ser)
idx.is_monotonic_increasing
ser[0] = 10
print(idx)
print('Is monotic increasing:', idx.is_monotonic_increasing)

gives

Int64Index([10, 1], dtype='int64')
Is monotic increasing: True

Comment From: ehansis

It seems that this issue caused random SIGSEGV and SIGBUS in my code (MacOS, pandas=1.1.4=py38hcf432d8_0 from conda_forge). Unfortunately, I cannot reproduce them in a minimal example. My code looks something like this:

df = pd.DataFrame([
    [... data here ...]
], columns=["code", "value", "foo"])
df = df.set_index("code", drop=False)
df.loc["some_code", "code"] = "abc"
df.loc["some_other_code", "code"] = "def"

This modifies the index, as described above. However, if I repeat the line

df.loc["some_other_code", "code"] = "def"

once more, the assignment sometimes works (apparently re-using the previous index values) and sometimes crashes the interpreter.

Comment From: rhshadrach

@ehansis - thanks for adding in here, however this appears to be a separate issue. If you add in print(id(df.index)), you should be seeing different codes, e.g.

139902856885584
139902875398688

before and after the df.loc["some_code", "code"] = "abc" line. This is because pandas is not modifying an index, but rather creating and replacing one.

Comment From: ehansis

@rhshadrach OK, thanks, I'll try to check that if I ever manage to get the segmentation faults reproduced. Let me know if I can be of further help.

Comment From: jorisvandenbossche

BTW, the other way around, when constructing a Series from an Index, we do take a copy to avoid such issues. In Series.__init__:

https://github.com/pandas-dev/pandas/blob/57d8d3a7cc2c4afc8746bf774b5062fa70c0f5fd/pandas/core/series.py#L401-L409

Comment From: rhshadrach

@jorisvandenbossche - it appears we no longer make a copy when constructing a Series from an Index due to https://github.com/pandas-dev/pandas/pull/52008.