srs1
are some 1000 random numbers to be binned using the boundaries in absths
:
srs1 = pd.Series(np.random.uniform(low=0, high=3e-17, size=(1000,)))
absths = np.array([0., 1.e-22, 1.e-18, 1.e-16])
Bin them, print out the boundaries, and the results for the first 5 numbers in srs1
:
ncut1 = pd.cut(srs1,
bins=absths,
include_lowest=True,
precision=16,
retbins=True)
print(ncut1[1])
print(ncut1[0][:5])
gives
[ 0.00000000e+00 1.00000000e-22 1.00000000e-18 1.00000000e-16]
0 (1.0000000000000001e-18, 9.9999999999999998e-17]
1 (1.0000000000000001e-18, 9.9999999999999998e-17]
2 (1.0000000000000001e-18, 9.9999999999999998e-17]
3 (1.0000000000000001e-18, 9.9999999999999998e-17]
4 (1.0000000000000001e-18, 9.9999999999999998e-17]
dtype: category
Categories (3, object): [[0, 1] < (1, 1.0000000000000001e-18] < (1.0000000000000001e-18, 9.9999999999999998e-17]]
The boundary that is meant to be 1e-22
is displayed as 1
in Categories. The keyword argument precision
is already set to 16 to display many decimals. Is this a bug or am I not using the function correctly?
Thanks
Comment From: jorisvandenbossche
There seems to go something wrong with the conversion of the bin edges to a string I think. Given that we are reworking this to be based on IntervalIndex, this may be fixed by that and @jreback this is maybe a case to test there in the PR? (https://github.com/pandas-dev/pandas/pull/15309)
Comment From: jreback
if you try with a higher precision, e.g. 22 this works, though I suspect you are actually hitting machine precision limits anyhow. Comparing numbers beyond 1e-15 can be somewhat arbitrary. So sometimes it will work, and if there are too many significant digits it might not work. I would either add a doc-note or raise if precision is too large here (IOW > 15). Yes this might work in #15309 because we are not stringifying but using actual values, but I think the same caveats apply.
So i'll mark this a doc-issue.
Comment From: anujloomba
I was wondering if there was any resolution on this? I am facing a similar issue with labels.
Comment From: jorisvandenbossche
@anujloomba can you provide a reproducible example that shows your problem? The problem illustrated above is in fact mostly fixed on master (although the 0 is now not represented correctly)
Comment From: jorisvandenbossche
@qAp on master I now get:
In [8]: print(ncut1[1])
[ 0.00000000e+00 1.00000000e-22 1.00000000e-18 1.00000000e-16]
In [9]: print(ncut1[0][:5])
0 (1e-18, 1.0000000000000001e-16]
1 (1e-18, 1.0000000000000001e-16]
2 (1e-18, 1.0000000000000001e-16]
3 (1e-18, 1.0000000000000001e-16]
4 (1e-18, 1.0000000000000001e-16]
dtype: category
Categories (3, interval[float64]): [(-1e-16, 1.0000000000000002e-22] < (1.0000000000000002e-22, 1e-18] <
(1e-18, 1.0000000000000001e-16]]
Which seems better as it displays the 1e-22 correctly. Although now the left bound of 0 is represented as -1e-16.