Problem description
When a Series is constructed from a float32, masked numpy array, calling mean()
on a resample produces NaNs. This doesn't occur with float64, masked arrays or non-masked float32 arrays. Some operations like first()
work while median()
raises a value error.
Code Sample, a copy-pastable example if possible
import numpy as np
import pandas as pd
arr32 = np.ma.array([1.0, 2.0, 3.0], mask=[False, False, False], dtype='float32')
arr64 = np.ma.array([1.0, 2.0, 3.0], mask=[False, False, False], dtype='float64')
index = pd.date_range(start='2018-03-01 12:00:00Z', end='2018-03-01 12:10:00Z',
freq='5min')
ser32 = pd.Series(arr32, index=index)
ser64 = pd.Series(arr64, index=index)
print('float32 masked array')
print(ser32.resample('5min').mean())
print(ser32.resample('10min').mean())
print('float64 masked array')
print(ser64.resample('5min').mean())
print(ser64.resample('10min').mean())
print('non-masked float32')
print(pd.Series(arr32.data, index=index).resample('5min').mean())
ser32.resample('5min').median()
which outputs
float32 masked array
2018-03-01 12:00:00+00:00 NaN
2018-03-01 12:05:00+00:00 NaN
2018-03-01 12:10:00+00:00 NaN
Freq: 5T, dtype: float32
2018-03-01 12:00:00+00:00 NaN
2018-03-01 12:10:00+00:00 NaN
Freq: 10T, dtype: float32
float64 masked array
2018-03-01 12:00:00+00:00 1.0
2018-03-01 12:05:00+00:00 2.0
2018-03-01 12:10:00+00:00 3.0
Freq: 5T, dtype: float64
2018-03-01 12:00:00+00:00 1.5
2018-03-01 12:10:00+00:00 3.0
Freq: 10T, dtype: float64
non-masked float32
2018-03-01 12:00:00+00:00 1.0
2018-03-01 12:05:00+00:00 2.0
2018-03-01 12:10:00+00:00 3.0
Freq: 5T, dtype: float32
Traceback (most recent call last):
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 1145, in median
return self._cython_agg_general('median', **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 921, in _cython_agg_general
min_count=min_count)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 2314, in aggregate
min_count=min_count)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 2242, in _cython_operation
values = _ensure_float64(values)
File "pandas/_libs/algos_common_helper.pxi", line 3182, in pandas._libs.algos.ensure_float64
File "pandas/_libs/algos_common_helper.pxi", line 3187, in pandas._libs.algos.ensure_float64
TypeError: astype() got an unexpected keyword argument 'copy'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 128, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 386, in nanmedian
values = values.ravel()
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/numpy/ma/core.py", line 4532, in ravel
r._mask = ndarray.ravel(self._mask, order=order).reshape(r.shape)
ValueError: cannot reshape array of size 0 into shape (1,)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "bad_pandas.py", line 26, in <module>
ser32.resample('5min').median()
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/resample.py", line 621, in f
return self._downsample(_method)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/resample.py", line 773, in _downsample
self.grouper, axis=self.axis).aggregate(how, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 3121, in aggregate
return getattr(self, func_or_funcs)(*args, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 1156, in median
return self._python_agg_general(f)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 939, in _python_agg_general
result, counts = self.grouper.agg_series(obj, f)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 2591, in agg_series
return grouper.get_result()
File "pandas/_libs/src/reduce.pyx", line 279, in pandas._libs.lib.SeriesBinGrouper.get_result
File "pandas/_libs/src/reduce.pyx", line 265, in pandas._libs.lib.SeriesBinGrouper.get_result
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 933, in <lambda>
f = lambda x: func(x, *args, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 1155, in f
return x.median(axis=self.axis, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/generic.py", line 7315, in stat_func
numeric_only=numeric_only)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/series.py", line 2577, in _reduce
return op(delegate, skipna=skipna, **kwds)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 77, in _f
return f(*args, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 131, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 386, in nanmedian
values = values.ravel()
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/numpy/ma/core.py", line 4532, in ravel
r._mask = ndarray.ravel(self._mask, order=order).reshape(r.shape)
ValueError: cannot reshape array of size 0 into shape (1,)
Output of pd.show_versions()
Comment From: jreback
so masked arrays are converted automatically for DataFrames, but I guess not for Series. We should just do this. A foreign ndarray like this doesn't have enough support to be a first class object in pandas (not too mention its too complex and to be honest not worth it, does anyone use masked arrays?)
So would take a PR to convert masked arrays for Series.
Comment From: dsm054
This seems to work for me even in pandas 0.22.0 with numpy >= 1.15.1. Maybe something changed which (unintentionally) handled this case?
Comment From: arw2019
This works on 1.2 master (due to #24581 and follow-ons).
There are tests in pandas/tests/frame/test_constructors
. The tests don't use a datetime index but AFAICT this isn't the core issue here
Comment From: jbrockmendel
Another difference between the Series/DataFrame behavior with numpy masked arrays is what we do with the fill value
from numpy.ma import mrecords
mask = [(True, False), (False, True), (False, False), (False, True), (False, False)]
data = np.ma.array(np.ma.zeros(5, dtype=[("date", "<f8"), ("price", "<f8")]), mask=mask, fill_value=9999)
recs = data.view(mrecords.mrecarray)
df = pd.DataFrame(recs)
sers = {name: pd.Series(recs[name]) for name in recs.dtype.names}
expected = pd.DataFrame(sers)
>>> df
date price
0 9999.0 0.0
1 0.0 9999.0
2 0.0 0.0
3 0.0 9999.0
4 0.0 0.0
>>> expected
date price
0 NaN 0.0
1 0.0 NaN
2 0.0 0.0
3 0.0 NaN
4 0.0 0.0
i.e. with the mrecords we fill with the array's fill_value, whereas for Series we ignore it. This happens bc for Series we go through sanitize_masked_array
while for MaskedRecords we go through fill_masked_arrays
.
Easy to make these match, just need to decide which is "right"