Pandas DOC: floating point precision on writing/reading to csv

Code Sample

x0 = 18292498239.824
df1 = pd.DataFrame({'One': x0},index=["bignum"])
df1.to_csv('repr_test.csv')
df2 = pd.DataFrame.from_csv('repr_test.csv')
df3 = pd.read_csv('repr_test.csv')
x1 = df1['One'][0]
x2 = df2['One'][0]
x3 = df3['One'][0]
fh = open('repr_test.csv','rb')
ll = fh.readlines()
x4 = float(ll[1].split(',')[1].split()[0])
print "x0 = %f; x1 = %f; Are they equal? %s" % (x0,x1,(x0 == x1))
print "x0 = %f; x2 = %f; Are they equal? %s" % (x0,x2,(x0 == x2))
print "x0 = %f; x3 = %f; Are they equal? %s" % (x0,x3,(x0 == x3))
print "x0 = %f; x4 = %f; Are they equal? %s" % (x0,x4,(x0 == x4))

Expected Output

x0 = 18292498239.824001; x1 = 18292498239.824001; Are they equal? True
x0 = 18292498239.824001; x2 = 18292498239.824001; Are they equal? True
x0 = 18292498239.824001; x3 = 18292498239.824001; Are they equal? True
x0 = 18292498239.824001; x4 = 18292498239.824001; Are they equal? True

output of `pd.show_versions()`

(Note that there are two, presented side-by-side, with results underneath)

INSTALLED VERSIONS                      INSTALLED VERSIONS
------------------                      ------------------
commit: None                            commit: None
python: 2.7.5.final.0                   python: 2.7.11.final.0
python-bits: 64                         python-bits: 64
OS: Linux                               OS: Linux
OS-release: 2.6.32-431.56.1.el6.x86_64  OS-release: 2.6.32-431.56.1.el6.x86_64
machine: x86_64                         machine: x86_64
processor: x86_64                       processor: x86_64
byteorder: little                       byteorder: little
LC_ALL: None                            LC_ALL: None
LANG: en_US.UTF-8                       LANG: en_US.UTF-8

pandas: 0.15.1                          pandas: 0.18.0
nose: 1.3.4                             nose: 1.3.7
Cython: 0.21.2                          Cython: 0.23.4
numpy: 1.9.1                            numpy: 1.10.4
scipy: 0.14.0                           scipy: 0.17.0                 
statsmodels: 0.6.0                      statsmodels: 0.6.1            
IPython: 2.3.0                          IPython: 4.1.2 
sphinx: 1.2.3                           sphinx: 1.3.5  
patsy: 0.3.0                            patsy: 0.4.0   
dateutil: 2.2                           dateutil: 2.5.1
pytz: 2014.9                            pytz: 2016.2   
bottleneck: None                        bottleneck: 1.0.0
tables: 3.1.1                           tables: 3.2.2    
numexpr: 2.4                            numexpr: 2.5     
matplotlib: 1.4.2                       matplotlib: 1.5.1
openpyxl: None                          openpyxl: 2.3.2  
xlrd: 0.9.3                             xlrd: 0.9.4      
xlwt: 0.7.5                             xlwt: 1.0.0      
xlsxwriter: 0.6.3                       xlsxwriter: 0.8.4
lxml: 3.3.3                             lxml: 3.6.0      
bs4: 4.3.2                              bs4: 4.4.1       
html5lib: None                          html5lib: None   
httplib2: None                          httplib2: None   
apiclient: None                         apiclient: None  
rpy2: None                              
sqlalchemy: None                        sqlalchemy: 1.0.12                                                    
pymysql: None                           pymysql: None 
psycopg2: None                          psycopg2: None
                                        pip: 8.1.1      
                                        xarray: None    
                                        setuptools: 20.3
                                        blosc: None     
                                        jinja2: 2.8     
                                        boto: 2.39.0

Results from left setup (0.15.1):

x0 = 18292498239.824001; x1 = 18292498239.824001; Are they equal? True
x0 = 18292498239.824001; x2 = 18292498239.823997; Are they equal? False
x0 = 18292498239.824001; x3 = 18292498239.823997; Are they equal? False
x0 = 18292498239.824001; x4 = 18292498239.824001; Are they equal? True

Results from right setup (0.18.0):

x0 = 18292498239.824001; x1 = 18292498239.824001; Are they equal? True
x0 = 18292498239.824001; x2 = 18292498239.799999; Are they equal? False
x0 = 18292498239.824001; x3 = 18292498239.799999; Are they equal? False
x0 = 18292498239.824001; x4 = 18292498239.799999; Are they equal? False

Expectations

I expect to be able to write a DataFrame to a csv file and later read it in to a new DataFrame such that the two DataFrames will be identical. The older version (result 0.15.1) is quite a bit better than the newer (since I can round to three decimal places to get the expected results or read from a filehandle instead of using from_csv() or read_csv()). The newer version (0.18.0) loses information, which is not acceptable.

Note that the documentation at http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.DataFrame.from_csv.html reads

It is preferable to use the more powerful pandas.read_csv() for most general purposes, but from_csv makes for an easy roundtrip to and from a file (the exact counterpart of to_csv), especially with a DataFrame of time series data.

But this does not describe what actually happens, as demonstrated above.

Comment From: sinhrks

Specify required precision via float_format.

df1.to_csv('repr_test.csv', float_format='%.6f')
df2 = pd.DataFrame.from_csv('repr_test.csv')
df2.iloc[0, 0]
# 18292498239.824001

Maybe doc should have float_format section (for output), as it does in float_precision (for input). - http://pandas.pydata.org/pandas-docs/stable/io.html#specifying-method-for-floating-point-conversion

Comment From: jreback

yes this is a tradeoff between speed of reading and exactness out to a certain ULP. as @sinhrks indicated for reading we offfer a higher precision option; writing is subject to the vagaries of floating point to stringifciation.

Comment From: kawochen

I think writing should have something similar to float_precision, since the round-trip-ability is based mostly on the number of significant digits, not the number of digits after the decimal point.

I haven't looked at the code, but the difference here seems to be related to defaulting to __str__() vs __repr__() on P2. __repr__() has enough digits for round-trip.

Comment From: BlGene

Please also consider the case where different columns having different rounding levels.

Comment From: dhavide

take

Comment From: hualiu01

@dhavide Is this issue resolved? Can I take this issue?

Comment From: hualiu01

take

Comment From: hualiu01

Tested the original error with python3(3.10.14), printed values are as expected.

Specifically, see code:

import pandas as pd

WORKDIR = '../tmp'

x0 = 18292498239.824
df1 = pd.DataFrame({'One': x0},index=["bignum"])
df1.to_csv(f'{WORKDIR}/repr_test.csv')
# df2 = pd.DataFrame.from_csv('repr_test.csv')
df3 = pd.read_csv(f'{WORKDIR}/repr_test.csv')
x1 = df1['One'].loc[df1.index[0]]
# x2 = df2['One'][0]
x3 = df3['One'].loc[df3.index[0]]
fh = open(f'{WORKDIR}/repr_test.csv','rb')
ll = fh.readlines()

# x4 = float(ll[1].split(',')[1].split()[0])
x4 = float(ll[1].decode().split(',')[1].split()[0])

print(f"x0 = {x0}; x1 = {x1}; Are they equal? {x0 == x1}")
# print(f"x0 = {x0}; x2 = {x2}; Are they equal? {x0 == x2}")
print(f"x0 = {x0}; x3 = {x3}; Are they equal? {x0 == x3}")
print(f"x0 = {x0}; x4 = {x4}; Are they equal? {x0 == x4}")

output

x0 = 18292498239.824; x1 = 18292498239.824; Are they equal? True
x0 = 18292498239.824; x3 = 18292498239.824; Are they equal? True
x0 = 18292498239.824; x4 = 18292498239.824; Are they equal? True

Comment From: gumus-g

Hi @hualiu01, just checking in on this issue — is it still actively being worked on? If not, I’d love to help with improving the documentation around floating-point precision for CSV I/O. Happy to build on the test cases already explored and draft a docs update if that would be helpful!

Comment From: dhavide

@gumus-g , I do have some notes (containing links to relevant sources) that I compiled on this a while back. I wanted to determine under what circumstances it is reasonable to expect round-trip conversion from CSV (i.e., decimal) to binary floating-point and back. Basically, there are specific conditions (i.e., inequalities) relating the precisions in decimal and binary that specify when you can reasonably expect a round-trip conversion to work exactly.

Unfortunately, I'm travelling and will not have access to the old laptop with those notes on them until I return home (~Aug. 1). If you do take this issue from @hualiu01 , I will share my notes with you as soon as I have them.

Comment From: gumus-g

Thanks, @dhavide — I’d definitely be interested in reviewing those notes once you’re back. The conditions around round-trip fidelity between decimal and binary representations sound like exactly the kind of nuance that would strengthen the docs. I’ll hold off on any final edits until I’ve had a chance to incorporate your sources. In the meantime, I’ll continue exploring the current behavior and edge cases, and will keep this thread updated if I draft anything preliminary. Safe travels, and looking forward to syncing up after Aug. 1!

Pandas DOC: floating point precision on writing/reading to csv

Code Sample

Expected Output

output of pd.show_versions()

(Note that there are two, presented side-by-side, with results underneath)

Results from left setup (0.15.1):

Results from right setup (0.18.0):

Expectations

output of `pd.show_versions()`