Reopening issue #14530. The close description is incorrect. The JSON specification explicitly states that limits are not in the specification.
From https://tools.ietf.org/html/rfc7159#section-6
This specification allows implementations to set limits on the range and precision of numbers accepted
The standard json
library in python supports large numbers, meaning the language supports JSON with these values.
Python 3.6.8 (default, Apr 7 2019, 21:09:51)
[GCC 5.3.1 20160406 (Red Hat 5.3.1-6)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> j = """{"id": 9253674967913938907}"""
>>> json.loads(j)
{'id': 9253674967913938907}
Loading a json file with large integers (> 2^32), results in "Value is too big". I have tried changing the orient to "records" and also passing in dtype={'id': numpy.dtype('uint64')}. The error is the same.
python import pandas data = pandas.read_json('''{"id": 10254939386542155531}''') print(data.describe())
Expected Output
id count 1 unique 1 top 10254939386542155531 freq 1
Actual Output (even with dtype passed in)
File "./parse_dispatch_table.py", line 34, in <module> print(pandas.read_json('''{"id": 10254939386542155531}''', dtype=dtype_conversions).describe()) File "/users/XXX/.local/lib/python3.4/site-packages/pandas/io/json.py", line 234, in read_json date_unit).parse() File "/users/XXX/.local/lib/python3.4/site-packages/pandas/io/json.py", line 302, in parse self._parse_no_numpy() File "/users/XXX/.local/lib/python3.4/site-packages/pandas/io/json.py", line 519, in _parse_no_numpy loads(json, precise_float=self.precise_float), dtype=None) ValueError: Value is too big
No problem using read_csv:
python import pandas import io print(pandas.read_csv(io.StringIO('''id\n10254939386542155531''')).describe())
Output using read_csv
id count 1 unique 1 top 10254939386542155531 freq 1
Output of
pd.show_versions()
Comment From: artdgn
Workaround:
A workaround that worked for me is to patch the loads
function that's used in read_json
.
You can patch it using the default python json
module or using simplejson instead of the standard python json module.
E.g. if you monkeypatch pandas like this:
import pandas.io.json
# monkeypatch using standard python json module
import json
pd.io.json._json.loads = lambda s, *a, **kw: json.loads(s)
# monkeypatch using faster simplejson module
import simplejson
pd.io.json._json.loads = lambda s, *a, **kw: simplejson.loads(s)
# normalising (unnesting) at the same time (for nested jsons)
pd.io.json._json.loads = lambda s, *a, **kw: pandas.io.json.json_normalize(simplejson.loads(s))
After this patch read_json()
should work on the example above (pandas.read_json('''{"id": 10254939386542155531}''', orient='index')
and on the JSON that was giving me trouble.
Suggestion for future solution:
A possible solution for pandas is to provide a parameter via which one can override (inject) the loads
function when calling read_json()
. E.g read_json(data, *args, loads_fn=None, **kwargs)
. If the maintainers think that's a viable solution, I'd be happy to submit a PR.
Comment From: mroeschke
Going to close this issue as I believe it's the same issue as https://github.com/pandas-dev/pandas/issues/20599
Comment From: deponovo
I don't think this issue is solved.
Test data (test.csv:):
data1,data2
10254939386542155531,12764512476851234
Test code:
import pandas as pd
import json
df = pd.read_csv(r'test.csv') # all fine
df
Out[45]:
data1 data2
0 10254939386542155531 12764512476851234
df.dtypes
Out[46]:
data1 uint64
data2 int64
dtype: object
pd.read_json(json.dumps(df.to_dict())) # prob
Results in error:
File "<env_path>\lib\site-packages\pandas\io\json\_json.py", line 1140, in _parse_no_numpy
loads(json, precise_float=self.precise_float), dtype=None
ValueError: Value is too big
then:
json.loads(json.dumps(df.to_dict()))
Out[48]: {'data1': {'0': 10254939386542155531}, 'data2': {'0': 12764512476851234}}
This behavior is therefore inconsistent. Also, 10254939386542155531 is allowed as uint64.
(Pandas 1.3.4, json 2.0.9, numpy 1.21.3)
Comment From: jreback
@deponovo this was redirected to another open issue - look there and see if the example is sufficient
issue get fixed when the members of the community puts up pull requests
the core team can provide code review
Comment From: deponovo
Is there a wiki how I could do this? Would be my first time.
Comment From: deponovo
take
Comment From: krpatter-intc
I'm using pandas 1.4.1 and can still reproduce this issue:
pandas.read_json(pandas.DataFrame([{"foo":441350462101887021235463396661461057}]).to_json(orient="table"), orient="table")
Sorry one thing to note: it seems the output (to_json) knows this is a problem and puts the type as string, but still writes to the json as an python long (instead of a string):
'{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"foo","type":"string"}],"primaryKey":["index"],"pandas_version":"1.4.0"},"data":[{"index":0,"foo":441350462101887021235463396661461057}]}'
Comment From: krpatter-intc
@deponovo Would it be helpful to open a new issue for this?
Is it ok to do the json overrides I see in other related issues, is it a performance trade off and that is why pandas uses a custom json parser?
Comment From: dmitriyshashkin
Workaround posted above doesn't seem to work anymore. Does anyone have a new workaround that works?
Comment From: gOATiful
for me neither, this seam to do the trick for me:
def load_json_lines_to_df(path: str) -> pd.DataFrame:
import json
with open(path) as infile:
ret = [json.loads(s) for s in infile.readlines()]
return pd.DataFrame(ret)