Code Sample, a copy-pastable example if possible
import pandas as pd
a = pd.DataFrame({"col1": [1, 2, 3]}, index=["a", "b", "c"])
b = a.copy()
b.index.values[0] = "newval"
assert a.index[0] == "newval"
Problem description
Index
objects are assumed to be immutable, but wrap a mutable object. In my experience, it's pretty common to just grab the .values
from pandas object. This has caused surprising behaviour in my own code, and has been discussed here before https://github.com/pandas-dev/pandas/issues/19862.
I would like to make sure that the values numpy array for index objects has it'sWRITABLE
flag set to false, to make Index
object more immutable.
Does this sound reasonable? If so, I can give a PR a shot.
Expected Output
# This should have thrown an error
b.index.values[0] = "newval"
Output of pd.show_versions()
Comment From: ivirshup
As is want to happen, this looks more complicated than I expected. After changing the values
property to return a read only view, 222 tests fail. I've investigated a bit. These fail with one of two error messages:
ValueError: buffer source array is read-only
(101 failures)
This looks like a cython issue, but I'm not that familiar with cython. I couldn't quite follow the history, but I think either cython memory view arrays don't work on read only memory, or they require a different signature to work. Not quite sure how to work around this one, but suspect there's a templating solution.
ValueError: assignment destination is read-only
(120 failures)
Intervals
It looks like interval arrays are constructed out of two indexes. But interval arrays can be mutated. This is currently done by copying the Index objects, then modifying the copies inplace. This seems very fixable.
These tests include: Most of the TestSetitem
classes tests, TestSeriesConvertDtypes.test_convert_dtypes
Update: This was very easy to fix.
A number of tests rely on inplace modification of index objects
For example: test_unique_null
, test_fillna_null
, test_unique_null
, test_value_counts_null
, TestFloat64Index.test_fillna
, TestFloat64Index.test_hasnans_isnans
I'm thinking these should just not do that.
Comment From: jbrockmendel
Do the tracebacks show what cython functions are raising? usually that means we need to change a type annotation in that function
Comment From: ivirshup
Yes, it's a bunch of functions. I could try and collect those if that would be useful.
Here's one example:
pandas/tests/indexes/datetimes/test_join.py:42:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/indexes/base.py:3441: in join
self, how=how, level=level, return_indexers=return_indexers
pandas/core/indexes/datetimelike.py:856: in join
sort=sort,
pandas/core/indexes/base.py:3451: in join
return this.join(other, how=how, return_indexers=return_indexers)
pandas/core/indexes/base.py:3471: in join
other, how=how, return_indexers=return_indexers
pandas/core/indexes/base.py:3778: in _join_monotonic
lidx = self._left_indexer_unique(ov, sv)
pandas/core/indexes/base.py:239: in _left_indexer_unique
return libjoin.left_join_indexer_unique(left, right)
pandas/_libs/join.pyx:270: in pandas._libs.join.left_join_indexer_unique
def left_join_indexer_unique(join_t[:] left, join_t[:] right):
stringsource:658: in View.MemoryView.memoryview_cwrapper
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E ValueError: buffer source array is read-only
stringsource:349: ValueError
So I thought this could be fixable by changing the signature from
left_join_indexer_unique(join_t[:] left, join_t[:] right)
to
left_join_indexer_unique(const join_t[:] left, const join_t[:] right)
But it turns out doing that raises this fun exception during compilation:
Exception: This may never happen, please report a bug
Full compilation traceback
Compiling pandas/_libs/join.pyx because it changed.
[1/1] Cythonizing pandas/_libs/join.pyx
Error compiling Cython file:
------------------------------------------------------------
...
# right might contain non-unique values
@cython.wraparound(False)
@cython.boundscheck(False)
def left_join_indexer_unique(const join_t[:] left, const join_t[:] right):
^
------------------------------------------------------------
pandas/_libs/join.pyx:280:42: Compiler crash in AnalyseDeclarationsTransform
File 'ModuleNode.py', line 124, in analyse_declarations: ModuleNode(join.pyx:1:0,
full_module_name = 'pandas._libs.join')
File 'Nodes.py', line 431, in analyse_declarations: StatListNode(join.pyx:1:0)
File 'Nodes.py', line 431, in analyse_declarations: StatListNode(join.pyx:280:0)
File 'Nodes.py', line 375, in analyse_declarations: CompilerDirectivesNode(join.pyx:280:0)
File 'Nodes.py', line 431, in analyse_declarations: StatListNode(join.pyx:280:0)
File 'Nodes.py', line 2887, in analyse_declarations: DefNode(join.pyx:280:0,
modifiers = [...]/0,
name = 'left_join_indexer_unique',
num_required_args = 2,
outer_attrs = [...]/2,
py_wrapper_required = True,
reqd_kw_flags_cname = '0')
File 'Nodes.py', line 2925, in analyse_argument_types: DefNode(join.pyx:280:0,
modifiers = [...]/0,
name = 'left_join_indexer_unique',
num_required_args = 2,
outer_attrs = [...]/2,
py_wrapper_required = True,
reqd_kw_flags_cname = '0')
File 'Nodes.py', line 1093, in analyse: MemoryViewSliceTypeNode(join.pyx:280:42,
name = 'memoryview')
Compiler crash traceback from this point on:
File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/Nodes.py", line 1093, in analyse
self.type = PyrexTypes.MemoryViewSliceType(base_type, axes_specs)
File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 625, in __init__
self.dtype_name = Buffer.mangle_dtype_name(self.dtype)
File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/Buffer.py", line 631, in mangle_dtype_name
return prefix + dtype.specialization_name()
File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 57, in specialization_name
common_subs = (self.empty_declaration_code()
File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 51, in empty_declaration_code
self._empty_declaration = self.declaration_code('')
File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 1582, in declaration_code
return self.const_base_type.declaration_code("const %s" % entity_code, for_display, dll_linkage, pyrex)
File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 1649, in declaration_code
raise Exception("This may never happen, please report a bug")
Exception: This may never happen, please report a bug
Traceback (most recent call last):
File "setup.py", line 791, in <module>
setup_package()
File "setup.py", line 761, in setup_package
ext_modules=maybe_cythonize(extensions, compiler_directives=directives),
File "setup.py", line 534, in maybe_cythonize
return cythonize(extensions, *args, **kwargs)
File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Build/Dependencies.py", line 1102, in cythonize
cythonize_one(*args)
File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Build/Dependencies.py", line 1225, in cythonize_one
raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: pandas/_libs/join.pyx
(pandas-dev) isaac@Mimir:~/github/pandas ‹master›
$ python3 setup.py build_ext --inplace -j 12
Compiling pandas/_libs/join.pyx because it changed.
[1/1] Cythonizing pandas/_libs/join.pyx
Error compiling Cython file:
------------------------------------------------------------
...
# right might contain non-unique values
@cython.wraparound(False)
@cython.boundscheck(False)
def left_join_indexer_unique(const join_t[:] left, const join_t[:] right):
^
------------------------------------------------------------
pandas/_libs/join.pyx:270:42: Compiler crash in AnalyseDeclarationsTransform
File 'ModuleNode.py', line 124, in analyse_declarations: ModuleNode(join.pyx:1:0,
full_module_name = 'pandas._libs.join')
File 'Nodes.py', line 431, in analyse_declarations: StatListNode(join.pyx:1:0)
File 'Nodes.py', line 431, in analyse_declarations: StatListNode(join.pyx:270:0)
File 'Nodes.py', line 375, in analyse_declarations: CompilerDirectivesNode(join.pyx:270:0)
File 'Nodes.py', line 431, in analyse_declarations: StatListNode(join.pyx:270:0)
File 'Nodes.py', line 2887, in analyse_declarations: DefNode(join.pyx:270:0,
modifiers = [...]/0,
name = 'left_join_indexer_unique',
num_required_args = 2,
outer_attrs = [...]/2,
py_wrapper_required = True,
reqd_kw_flags_cname = '0')
File 'Nodes.py', line 2925, in analyse_argument_types: DefNode(join.pyx:270:0,
modifiers = [...]/0,
name = 'left_join_indexer_unique',
num_required_args = 2,
outer_attrs = [...]/2,
py_wrapper_required = True,
reqd_kw_flags_cname = '0')
File 'Nodes.py', line 1093, in analyse: MemoryViewSliceTypeNode(join.pyx:270:42,
name = 'memoryview')
Compiler crash traceback from this point on:
File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/Nodes.py", line 1093, in analyse
self.type = PyrexTypes.MemoryViewSliceType(base_type, axes_specs)
File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 625, in __init__
self.dtype_name = Buffer.mangle_dtype_name(self.dtype)
File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/Buffer.py", line 631, in mangle_dtype_name
return prefix + dtype.specialization_name()
File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 57, in specialization_name
common_subs = (self.empty_declaration_code()
File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 51, in empty_declaration_code
self._empty_declaration = self.declaration_code('')
File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 1582, in declaration_code
return self.const_base_type.declaration_code("const %s" % entity_code, for_display, dll_linkage, pyrex)
File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 1649, in declaration_code
raise Exception("This may never happen, please report a bug")
Exception: This may never happen, please report a bug
Traceback (most recent call last):
File "setup.py", line 791, in <module>
setup_package()
File "setup.py", line 761, in setup_package
ext_modules=maybe_cythonize(extensions, compiler_directives=directives),
File "setup.py", line 534, in maybe_cythonize
return cythonize(extensions, *args, **kwargs)
File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Build/Dependencies.py", line 1102, in cythonize
cythonize_one(*args)
File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Build/Dependencies.py", line 1225, in cythonize_one
raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: pandas/_libs/join.pyx
Comment From: shoyer
There was a long-standing issue in Cython where it didn't used to support read-only memory views, but that was fixed a few years ago now: https://github.com/cython/cython/issues/1605 https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#read-only-views
I don't know what minimum Cython version is required these days to build pandas from source, but it seems like it should probably be OK to use read-only views in pandas now?
@ivirshup Were you using at least Cython version 0.28 when you encountered the error?
Comment From: ivirshup
@shoyer yes, this was using cython v0.29.16 (I'm using the pandas/environment.yaml
conda env).
Those are the docs that made me think I could just chuck a const
in front of the signature.
It looks like @jbrockmendel may have run into this issue before https://github.com/cython/cython/issues/3222.
Comment From: jbrockmendel
For fused types like join_t
cython doesnt support const memoryviews until 0.30, so we're eagerly awaiting that
Comment From: ivirshup
I see that this is fixed in: https://github.com/cython/cython/pull/3118. It looks like this was not considered a bug fix.
Would pandas put a hard restriction on cython >=0.30.0
(it looks like this might be >=3
) once it releases, or do you aim to support older cython versions as well?
Comment From: jbrockmendel
or do you aim to support older cython versions as well?
We distribute wheels, so don't have cython as an install requirement. we are pretty aggressive in bumping it as a dev requirement.
I'm also looking forward to conditional nogil in 0.30
Comment From: TomAugspurger
Regardless of the Cython issues, is this something we actually want to do? What are the implications of this? When we create a Series from an Index (say via a .reset_index()
) would the Series data be writable?
Keep in mind that we'd like to support indexes backed by non-ndarrays some day, which won't necessarily have the same API / capability for marking data as non-writable.
Comment From: ivirshup
If you make a series from an index, you should be making a copy of the underlying data, so yes it would be writable.
I think the current sharing of references, with the false assumption of immutability, can lead to some very bad behavior. I think if pd.Index
objects are going to be treated as immutable, their public methods and attributes should enforce that.
One solution for array classes which do not support immutability would be to return an instance of an immutable view wrapper class. Another would be to return copies of the data from .values
and .array
.
Update: I've opened a PR for this, mainly so it's easy to share what the barriers are.
Comment From: shoyer
I think the current sharing of references, with the false assumption of immutability, can lead to some very bad behavior. I think if
pd.Index
objects are going to be treated as immutable, their public methods and attributes should enforce that.
Strong 👍 from me. "False immutability" has resulted in a number of subtle downstream bugs in xarray, for example.
I think it would be better to make an extra array copy when converting from an Index
to a Series
than to allow modifying the series to break the values. Ideally we would have copy-on-write, but that's a bigger challenge.
Comment From: ivirshup
@TomAugspurger, did you have more thoughts on the feasibility/ desirability of immutability enforcement? I wasn't sure if I should be putting more effort into this yet.
Comment From: TomAugspurger
I think it's a good goal, but I vaguely recall issues with passing readonly arrays to Cython. Perhaps most of those have been fixed.
To test, I'd recommend temporarily changing the Index constructors to all mark their arrays as immutable and run the full test suite to see what breaks.
On Thu, Apr 16, 2020 at 3:48 AM Isaac Virshup notifications@github.com wrote:
@TomAugspurger https://github.com/TomAugspurger, did you have more thoughts on the feasibility/ desirability of immutability enforcement? I wasn't sure if I should be putting more effort into this yet.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/33001#issuecomment-614507715, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIVPIYOKB2OLD5XVSXTRM3A5BANCNFSM4LTH5AVQ .