Pandas Possible bug in df.update

Code Sample, a copy-pastable example if possible

test_json = [{"_id": 'a', 'date': datetime.now()}, {"_id": 'b', 'date': datetime.now()}]
test_df = pd.DataFrame(test_json)

new_df = test_df.copy()
new_df["date"] = None
new_df.update(test_df)

print(test_df.head())
print(new_df.head())

Problem description

When using update function with datetime data, it is automatically converted to timestamp, which for me it seems like an abnormal behaviour. Code from above would output

_id date 0 a 2019-11-07 15:50:06.072158 1 b 2019-11-07 15:50:06.072158 _id date 0 a 1573141806072158000 1 b 1573141806072158000

Expected Output

_id date 0 a 2019-11-07 15:50:06.072158 1 b 2019-11-07 15:50:06.072158 _id date 0 a 2019-11-07 15:50:06.072158 1 b 2019-11-07 15:50:06.072158

Output of `pd.show_versions()`

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit : None python : 3.7.4.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 61 Stepping 4, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None pandas : 0.25.1 numpy : 1.16.4 pytz : 2019.2 dateutil : 2.8.0 pip : 19.2.2 setuptools : 41.0.1 Cython : 0.29.13 pytest : None hypothesis : None sphinx : 2.1.2 blosc : None feather : None xlsxwriter : None lxml.etree : 4.4.1 html5lib : None pymysql : None psycopg2 : None jinja2 : 2.10.1 IPython : 7.7.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : 4.4.1 matplotlib : 3.1.1 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : 1.3.1 sqlalchemy : None tables : None xarray : None xlrd : None xlwt : None xlsxwriter : None

Comment From: susan-shu-c

Hi, I was able to reproduce this result. This is due to pandas.DataFrame.update calling expressions.where. source link.

From then it eventually calls numpy.where documentation which then eventually uses the Numpy MaskedArray type. source link.

This seems to be a choice to use numpy.where, which causes the datetime type to be converted to unix time, which speeds up the computation. However feel free to correct me on that.

I'd suggest trying pandas.to_datetime linked here to convert them afterward (sometimes you have to reduce your unix time precision, by removing digits from the end, to get it to work), but I haven't tested on your example data yet, so feel free to try it.

Comment From: rhshadrach

Indeed, this appears to be an odd interaction with DatetimeArray and np.where.

a = np.asarray([None], dtype=np.object)
b = np.asarray(pd.arrays.DatetimeArray(pd.Series([datetime.now()])))
cond = [False]
print(np.where(cond, a, b))

gives [1595782471507905000]; whereas

a = np.asarray([None], dtype=np.object)
b = np.asarray([datetime.now()], dtype=np.object)
cond = [False]
print(np.where(cond, a, b))

gives [datetime.datetime(2020, 7, 26, 16, 53, 4, 806281)]

Comment From: jbrockmendel

i see the expected behavior on main (looks like a np.where call got changed to use Series.where somewhere along the line). Could use a test (first check to see if one already exists)

Comment From: takesanocean

Not reproducible for me either. Will check if there is a test already for this and will create one if none

take

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`