python - 2.2GB JSON file parses inconsistently -
i trying decode large utf-8 json file (2.2 gb). load file so:
f = codecs.open('output.json', encoding='utf-8') data = f.read() if try of: json.load, json.loads or json.jsondecoder().raw_decode error:
--------------------------------------------------------------------------- valueerror traceback (most recent call last) <ipython-input-40-fc2255017b19> in <module>() ----> 1 j = jd.decode(data) /usr/lib/python2.7/json/decoder.pyc in decode(self, s, _w) 367 end = _w(s, end).end() 368 if end != len(s): --> 369 raise valueerror(errmsg("extra data", s, end, len(s))) 370 return obj 371 valueerror: data: line 1 column -2065998994 - line 1 column 2228968302 (char -2065998994 - 2228968302) uname -m shows x86_64 ,
> python -c 'import sys;print("%x" % sys.maxsize, sys.maxsize > 2**32)' ('7fffffffffffffff', true)` so should on 64 bit , integer size shouldn't problem.
however, if run:
jd = json.jsondecoder() len(data) # 2228968302 j = jd.raw_decode(data) j[1] # 2228968302 the second value in tuple returned raw_decode end of string, raw_decode seems parse entire file seemingly no garbage @ end.
so, there should doing differently json? raw_decode decoding entire file? why json.load(s) failing?
i'd add comment, formatting capabilities in comments limited.
staring @ source code,
raise valueerror(errmsg("extra data", s, end, len(s))) calls function:
def errmsg(msg, doc, pos, end=none): ... fmt = '{0}: line {1} column {2} - line {3} column {4} (char {5} - {6})' return fmt.format(msg, lineno, colno, endlineno, endcolno, pos, end) the (char {5} - {6}) part of format part of error message showed:
(char -2065998994 - 2228968302) so, in errmsg(), pos -2065998994 , end 2228968302. behold! ;-):
>>> pos = -2065998994 >>> end = 2228968302 >>> 2**32 + pos 2228968302l >>> 2**32 + pos == end true that is, pos , end "really" same. errmsg() called, means end , len(s) same - end being viewed 32-bit signed integer. end in turn comes regular expression match object's end() method.
so real problem here appears 32-bit limitation/assumption in regexp engine. encourage open bug report!
later: answer questions, yes, raw_decode() decoding entire file. other methods call raw_decode(), add (failing!) sanity checks afterwards.
Comments
Post a Comment