Pandas ERR: HDF5 serialization of datelike-object dtypes should raise

UPDATE:

In [157]: problem_date = old.dob.loc[4231354]

In [158]: problem_date
Out[158]: datetime.date(2939, 6, 2)

In [159]: test_series = pd.Series([problem_date])

In [160]: pd.to_datetime(test_series)
Out[160]: 
0    2939-06-02
dtype: object

It seems this is the source of the problem. I think there may be other dates in my dataset that are breaking the to_datetime method

UPDATE 2:

It seems that maybe it's that the date is later than 2900 that's causing the problem?

In [194]: pd.to_datetime(old.dob[8230866])                
Out[194]: datetime.date(2955, 8, 22)

In [195]: another_bad_date = old.dob.loc[8230866]         

In [196]: pd.to_datetime(pd.Series([another_bad_date]))
Out[196]: 
0    2955-08-22
dtype: object

original issue:

The column in question came from a read_sql query, and the column has datetimes. It consists solely of pandas datetime objects and NoneType objects. I have iterated over the Series to be sure. The column has 11 million rows.

I've tried casting with to_datetime (and the dtype remains object--shouldn't the dtype change after that call?), to no avail.

Here's some stuff I get from poking around after sticking an import pdb; pdb.set_trace() into line 3329 of pytables.py (after except (NotImplementedError, ValueError, TypeError) as e:):

(Pdb) b

(Pdb) i

3

(Pdb) blocks[3]

ObjectBlock: [1, 2, 3, 4, 9, 12, 13, 14], 8 x 8255524, dtype: object

(Pdb) blk_items[3]

Index([u'dob', u'City', u'Region', u'Zip', u'lang', u'UnsubscribedDate', u'BadAddressDate', u'ISP'], dtype='object')

(Pdb) existing_col

(Pdb) col

name->values_block_3,cname->values_block_3,dtype->None,shape->None

(Pdb) b

(Pdb) type(b)

<class 'pandas.core.internals.ObjectBlock'>

(Pdb) block_items

*** NameError: name 'block_items' is not defined

(Pdb) b_items

Index([u'dob', u'City', u'Region', u'Zip', u'lang', u'UnsubscribedDate', u'BadAddressDate', u'ISP'], dtype='object')

(Pdb) existing_col

(Pdb) e

TypeError('Cannot serialize the column [dob] because\nits data contents are [mixed] object dtype',)

(Pdb) type(col)

<class 'pandas.io.pytables.DataCol'>

(Pdb) lib

<module 'pandas.lib' from '/home/mmccrea/anaconda/lib/python2.7/site-packages/pandas/lib.so'>

My debugging kinds of hits a wall here, because it seems infer_dtype seems to be throwing the error, which is in lib.so, which is a compiled binary and I'm not sure how to look into that to figure out what's going on. I would love a suggestion about how to deal with that in the future, in addition to some answers about what's going on in this case.

Comment From: jreback

see here: http://pandas.pydata.org/pandas-docs/stable/gotchas.html#minimum-and-maximum-timestamps

these are out of range of the high performance datetime impl, so these revert to object dtypes.

An alternative is to use Periods. (though their is an open issue with storing these in HDF5. Its not difficult, just needs a bit of work, see here

So this should raise ATM in HDF5. These cannot be serialized in table format at all (Object block is restricted to actual strings). I think fixed format might work.

That said if you would like to work on the period repr would be great5.

Comment From: rockg

Just curious, what are you doing that you need dates out to 2900?

Comment From: cowpig

OK, so I'm thinking that the first problem is that to_datetime ignores errors by default, and I'll put in a pull request to fix that.

I might look closer at the Periods thing later this week.

Comment From: cowpig

oh, and @rockg it a database of time travelers (jk they're database errors)

Comment From: jbrockmendel

pandas now supports non-nano dt64, so the datetime objects in 2955 should be fine. can you confirm the pytables part of the issue is fixed?