Pandas ENH: convert masked arrays for Series

Problem description

When a Series is constructed from a float32, masked numpy array, calling mean() on a resample produces NaNs. This doesn't occur with float64, masked arrays or non-masked float32 arrays. Some operations like first() work while median() raises a value error.

Code Sample, a copy-pastable example if possible

import numpy as np                                                                                                                                             
import pandas as pd                                                                                                                                            


arr32 = np.ma.array([1.0, 2.0, 3.0], mask=[False, False, False], dtype='float32')                                                                              
arr64 = np.ma.array([1.0, 2.0, 3.0], mask=[False, False, False], dtype='float64')
index = pd.date_range(start='2018-03-01 12:00:00Z', end='2018-03-01 12:10:00Z',
                      freq='5min')

ser32 = pd.Series(arr32, index=index)
ser64 = pd.Series(arr64, index=index)

print('float32 masked array')                                                                                                                                  
print(ser32.resample('5min').mean())
print(ser32.resample('10min').mean())

print('float64 masked array')                                                                                                                                  
print(ser64.resample('5min').mean())
print(ser64.resample('10min').mean())

print('non-masked float32')                                                                                                                                    
print(pd.Series(arr32.data, index=index).resample('5min').mean())

ser32.resample('5min').median()

which outputs

float32 masked array                                                           
2018-03-01 12:00:00+00:00   NaN                                                
2018-03-01 12:05:00+00:00   NaN                                                
2018-03-01 12:10:00+00:00   NaN                                                
Freq: 5T, dtype: float32                                                       
2018-03-01 12:00:00+00:00   NaN                                                
2018-03-01 12:10:00+00:00   NaN                                                
Freq: 10T, dtype: float32                                                      
float64 masked array                                                           
2018-03-01 12:00:00+00:00    1.0                                               
2018-03-01 12:05:00+00:00    2.0                                               
2018-03-01 12:10:00+00:00    3.0                                               
Freq: 5T, dtype: float64                                                       
2018-03-01 12:00:00+00:00    1.5                                               
2018-03-01 12:10:00+00:00    3.0                                               
Freq: 10T, dtype: float64                                                      
non-masked float32                                                             
2018-03-01 12:00:00+00:00    1.0                                               
2018-03-01 12:05:00+00:00    2.0                                               
2018-03-01 12:10:00+00:00    3.0                                               
Freq: 5T, dtype: float32 

Traceback (most recent call last):
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 1145, in median
return self._cython_agg_general('median', **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 921, in _cython_agg_general
min_count=min_count)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 2314, in aggregate
min_count=min_count)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 2242, in _cython_operation
values = _ensure_float64(values)
File "pandas/_libs/algos_common_helper.pxi", line 3182, in pandas._libs.algos.ensure_float64
File "pandas/_libs/algos_common_helper.pxi", line 3187, in pandas._libs.algos.ensure_float64
TypeError: astype() got an unexpected keyword argument 'copy'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 128, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 386, in nanmedian
values = values.ravel()
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/numpy/ma/core.py", line 4532, in ravel
r._mask = ndarray.ravel(self._mask, order=order).reshape(r.shape)
ValueError: cannot reshape array of size 0 into shape (1,)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "bad_pandas.py", line 26, in <module>
ser32.resample('5min').median()
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/resample.py", line 621, in f
return self._downsample(_method)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/resample.py", line 773, in _downsample
self.grouper, axis=self.axis).aggregate(how, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 3121, in aggregate
return getattr(self, func_or_funcs)(*args, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 1156, in median
return self._python_agg_general(f)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 939, in _python_agg_general
result, counts = self.grouper.agg_series(obj, f)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 2591, in agg_series
return grouper.get_result()
File "pandas/_libs/src/reduce.pyx", line 279, in pandas._libs.lib.SeriesBinGrouper.get_result
File "pandas/_libs/src/reduce.pyx", line 265, in pandas._libs.lib.SeriesBinGrouper.get_result
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 933, in <lambda>
f = lambda x: func(x, *args, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 1155, in f
return x.median(axis=self.axis, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/generic.py", line 7315, in stat_func
numeric_only=numeric_only)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/series.py", line 2577, in _reduce
return op(delegate, skipna=skipna, **kwds)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 77, in _f
return f(*args, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 131, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 386, in nanmedian
values = values.ravel()
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/numpy/ma/core.py", line 4532, in ravel
r._mask = ndarray.ravel(self._mask, order=order).reshape(r.shape)
ValueError: cannot reshape array of size 0 into shape (1,)

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Linux OS-release: 4.15.9-300.fc27.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.22.0 pytest: 3.4.2 pip: 9.0.1 setuptools: 38.5.1 Cython: None numpy: 1.14.2 scipy: None pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.7.0 pytz: 2018.3 blosc: 1.5.1 bottleneck: None tables: None numexpr: 2.6.4 feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: 1.2.5 pymysql: 0.8.0 psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None None

Comment From: jreback

so masked arrays are converted automatically for DataFrames, but I guess not for Series. We should just do this. A foreign ndarray like this doesn't have enough support to be a first class object in pandas (not too mention its too complex and to be honest not worth it, does anyone use masked arrays?)

So would take a PR to convert masked arrays for Series.

Comment From: dsm054

This seems to work for me even in pandas 0.22.0 with numpy >= 1.15.1. Maybe something changed which (unintentionally) handled this case?

Comment From: arw2019

This works on 1.2 master (due to #24581 and follow-ons).

There are tests in pandas/tests/frame/test_constructors. The tests don't use a datetime index but AFAICT this isn't the core issue here

Comment From: jbrockmendel

Another difference between the Series/DataFrame behavior with numpy masked arrays is what we do with the fill value

from numpy.ma import mrecords

mask = [(True, False), (False, True), (False, False), (False, True), (False, False)]
data = np.ma.array(np.ma.zeros(5, dtype=[("date", "<f8"), ("price", "<f8")]), mask=mask, fill_value=9999)

recs = data.view(mrecords.mrecarray)

df = pd.DataFrame(recs)

sers = {name: pd.Series(recs[name]) for name in recs.dtype.names}
expected = pd.DataFrame(sers)

>>> df
     date   price
0  9999.0     0.0
1     0.0  9999.0
2     0.0     0.0
3     0.0  9999.0
4     0.0     0.0

>>> expected
   date  price
0   NaN    0.0
1   0.0    NaN
2   0.0    0.0
3   0.0    NaN
4   0.0    0.0

i.e. with the mrecords we fill with the array's fill_value, whereas for Series we ignore it. This happens bc for Series we go through sanitize_masked_array while for MaskedRecords we go through fill_masked_arrays.

Easy to make these match, just need to decide which is "right"

Pandas ENH: convert masked arrays for Series

Problem description

Code Sample, a copy-pastable example if possible

Output of pd.show_versions()

Output of `pd.show_versions()`