python - 2.2GB JSON file parses inconsistently -
i trying decode large utf-8 json file (2.2 gb). load file so:
f = codecs.open('output.json', encoding='utf-8') data = f.read()
if try of: json.load
, json.loads
or json.jsondecoder().raw_decode
error:
--------------------------------------------------------------------------- valueerror traceback (most recent call last) <ipython-input-40-fc2255017b19> in <module>() ----> 1 j = jd.decode(data) /usr/lib/python2.7/json/decoder.pyc in decode(self, s, _w) 367 end = _w(s, end).end() 368 if end != len(s): --> 369 raise valueerror(errmsg("extra data", s, end, len(s))) 370 return obj 371 valueerror: data: line 1 column -2065998994 - line 1 column 2228968302 (char -2065998994 - 2228968302)
uname -m
shows x86_64
,
> python -c 'import sys;print("%x" % sys.maxsize, sys.maxsize > 2**32)' ('7fffffffffffffff', true)`
so should on 64 bit , integer size shouldn't problem.
however, if run:
jd = json.jsondecoder() len(data) # 2228968302 j = jd.raw_decode(data) j[1] # 2228968302
the second value in tuple returned raw_decode
end of string, raw_decode
seems parse entire file seemingly no garbage @ end.
so, there should doing differently json? raw_decode
decoding entire file? why json.load(s)
failing?
i'd add comment, formatting capabilities in comments limited.
staring @ source code,
raise valueerror(errmsg("extra data", s, end, len(s)))
calls function:
def errmsg(msg, doc, pos, end=none): ... fmt = '{0}: line {1} column {2} - line {3} column {4} (char {5} - {6})' return fmt.format(msg, lineno, colno, endlineno, endcolno, pos, end)
the (char {5} - {6})
part of format part of error message showed:
(char -2065998994 - 2228968302)
so, in errmsg()
, pos
-2065998994 , end
2228968302. behold! ;-):
>>> pos = -2065998994 >>> end = 2228968302 >>> 2**32 + pos 2228968302l >>> 2**32 + pos == end true
that is, pos
, end
"really" same. errmsg()
called, means end
, len(s)
same - end
being viewed 32-bit signed integer. end
in turn comes regular expression match object's end()
method.
so real problem here appears 32-bit limitation/assumption in regexp engine. encourage open bug report!
later: answer questions, yes, raw_decode()
decoding entire file. other methods call raw_decode()
, add (failing!) sanity checks afterwards.
Comments
Post a Comment