UPDATE:
In [157]: problem_date = old.dob.loc[4231354]
In [158]: problem_date
Out[158]: datetime.date(2939, 6, 2)
In [159]: test_series = pd.Series([problem_date])
In [160]: pd.to_datetime(test_series)
Out[160]:
0 2939-06-02
dtype: object
It seems this is the source of the problem. I think there may be other dates in my dataset that are breaking the to_datetime
method
UPDATE 2:
It seems that maybe it's that the date is later than 2900 that's causing the problem?
In [194]: pd.to_datetime(old.dob[8230866])
Out[194]: datetime.date(2955, 8, 22)
In [195]: another_bad_date = old.dob.loc[8230866]
In [196]: pd.to_datetime(pd.Series([another_bad_date]))
Out[196]:
0 2955-08-22
dtype: object
original issue:
The column in question came from a read_sql query, and the column has datetimes. It consists solely of pandas datetime objects and NoneType objects. I have iterated over the Series to be sure. The column has 11 million rows.
I've tried casting with to_datetime (and the dtype remains object--shouldn't the dtype change after that call?), to no avail.
Here's some stuff I get from poking around after sticking an import pdb; pdb.set_trace()
into line 3329 of pytables.py (after except (NotImplementedError, ValueError, TypeError) as e:
):
(Pdb) b
(Pdb) i
3
(Pdb) blocks[3]
ObjectBlock: [1, 2, 3, 4, 9, 12, 13, 14], 8 x 8255524, dtype: object
(Pdb) blk_items[3]
Index([u'dob', u'City', u'Region', u'Zip', u'lang', u'UnsubscribedDate', u'BadAddressDate', u'ISP'], dtype='object')
(Pdb) existing_col
(Pdb) col
name->values_block_3,cname->values_block_3,dtype->None,shape->None
(Pdb) b
(Pdb) type(b)
<class 'pandas.core.internals.ObjectBlock'>
(Pdb) block_items
*** NameError: name 'block_items' is not defined
(Pdb) b_items
Index([u'dob', u'City', u'Region', u'Zip', u'lang', u'UnsubscribedDate', u'BadAddressDate', u'ISP'], dtype='object')
(Pdb) existing_col
(Pdb) e
TypeError('Cannot serialize the column [dob] because\nits data contents are [mixed] object dtype',)
(Pdb) type(col)
<class 'pandas.io.pytables.DataCol'>
(Pdb) lib
<module 'pandas.lib' from '/home/mmccrea/anaconda/lib/python2.7/site-packages/pandas/lib.so'>
My debugging kinds of hits a wall here, because it seems infer_dtype
seems to be throwing the error, which is in lib.so, which is a compiled binary and I'm not sure how to look into that to figure out what's going on. I would love a suggestion about how to deal with that in the future, in addition to some answers about what's going on in this case.
Comment From: jreback
see here: http://pandas.pydata.org/pandas-docs/stable/gotchas.html#minimum-and-maximum-timestamps
these are out of range of the high performance datetime impl, so these revert to object dtypes.
An alternative is to use Periods. (though their is an open issue with storing these in HDF5. Its not difficult, just needs a bit of work, see here
So this should raise ATM in HDF5. These cannot be serialized in table format at all (Object block is restricted to actual strings). I think fixed format might work.
That said if you would like to work on the period repr would be great5.
Comment From: rockg
Just curious, what are you doing that you need dates out to 2900?
Comment From: cowpig
OK, so I'm thinking that the first problem is that to_datetime ignores errors by default, and I'll put in a pull request to fix that.
I might look closer at the Periods thing later this week.
Comment From: cowpig
oh, and @rockg it a database of time travelers (jk they're database errors)
Comment From: jbrockmendel
pandas now supports non-nano dt64, so the datetime objects in 2955 should be fine. can you confirm the pytables part of the issue is fixed?