Code Sample
import pandas as pd
df = pd.DataFrame(columns = ['a', 'b'])
def foo(row):
return True
def bar(row):
row['a']
return True
t = df.apply(bar, axis=1)
print(type(t))
t = df.apply(foo, axis=1)
print(type(t))
Problem description
When apply individual functions on the same empty DataFrame
, it return different types.
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
That's wired since their only difference is row['a']
expression which is never executed.
When df is not empty, both return pd.Series
.
Test on pandas 1.19.1 and latest version 0.20.2.
Expected Output
I expected it always return pd.Series
Output of pd.show_versions()
2017-06-07 19:52:12 [pip.vcs] DEBUG: Registered VCS backend: git
2017-06-07 19:52:13 [pip.vcs] DEBUG: Registered VCS backend: hg
2017-06-07 19:52:13 [pip.vcs] DEBUG: Registered VCS backend: svn
2017-06-07 19:52:13 [pip.vcs] DEBUG: Registered VCS backend: bzr
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 16.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: zh_CN.UTF-8
LOCALE: None.None
pandas: 0.20.2
pytest: None
pip: 7.1.0
setuptools: 18.0.1
Cython: None
numpy: 1.12.1
scipy: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.6.0
bs4: None
html5lib: None
sqlalchemy: 0.9.10
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None
Comment From: TomAugspurger
I suspect that bar
raises a KeyError, which is caught inside .apply
and sends it down a different code path. You're welcome to take a look at whats going on.
Comment From: jreback
agree with @TomAugspurger might be excepting out of the inner loop. @gzcf if you want to investigate and see if you can make a fix that passes the test suite would be fine.
Comment From: gzcf
I'm glad to help. Let me take some time to fix it.
Comment From: gzcf
After I read related codes, I found this is a intended behavior. Check issue #2476 and _apply_empty_result
in 'pandas/core/frame.py'.
def _apply_empty_result(self, func, axis, reduce, *args, **kwds):
if reduce is None:
reduce = False
try:
reduce = not isinstance(func(_EMPTY_SERIES, *args, **kwds),
Series)
except Exception:
pass
if reduce:
return Series(NA, index=self._get_agg_axis(axis))
else:
return self.copy()
Look, these code will try guessing return type by calling func an empty Series. I don't think this is a good implementation. it's bad to except Exception
, it will swallow all exceptions. At many cases, calling func with an empty Series will raise KeyError
. But I am new to pandas source, I am not sure what's next to do.
There are some choices:
- Default to Series without guessing type, maybe give some warning message meanwhile
- Default to DataFrame...
- Don't change this behavior
- Let it fail and raise exception to user. This is original behavior.
Please give me some advice.
Comment From: Mega-Tom
I think the correct resolution is to use a series with data types from the data frame, instead of _EMPTY_SERIES
. Just using simple default values for each datatype should make it much less likely to raise an exception.