python - 2.2GB JSON file parses inconsistently -


i trying decode large utf-8 json file (2.2 gb). load file so:

f = codecs.open('output.json', encoding='utf-8') data = f.read() 

if try of: json.load, json.loads or json.jsondecoder().raw_decode error:

--------------------------------------------------------------------------- valueerror                                traceback (most recent call last) <ipython-input-40-fc2255017b19> in <module>() ----> 1 j = jd.decode(data)  /usr/lib/python2.7/json/decoder.pyc in decode(self, s, _w)     367         end = _w(s, end).end()     368         if end != len(s): --> 369             raise valueerror(errmsg("extra data", s, end, len(s)))     370         return obj     371  valueerror: data: line 1 column -2065998994 - line 1 column 2228968302     (char -2065998994 - 2228968302) 


uname -m shows x86_64 ,

> python -c 'import sys;print("%x" % sys.maxsize, sys.maxsize > 2**32)' ('7fffffffffffffff', true)` 

so should on 64 bit , integer size shouldn't problem.

however, if run:

jd = json.jsondecoder() len(data) # 2228968302 j = jd.raw_decode(data) j[1] # 2228968302  

the second value in tuple returned raw_decode end of string, raw_decode seems parse entire file seemingly no garbage @ end.

so, there should doing differently json? raw_decode decoding entire file? why json.load(s) failing?

i'd add comment, formatting capabilities in comments limited.

staring @ source code,

raise valueerror(errmsg("extra data", s, end, len(s))) 

calls function:

def errmsg(msg, doc, pos, end=none):     ...     fmt = '{0}: line {1} column {2} - line {3} column {4} (char {5} - {6})'     return fmt.format(msg, lineno, colno, endlineno, endcolno, pos, end) 

the (char {5} - {6}) part of format part of error message showed:

(char -2065998994 - 2228968302) 

so, in errmsg(), pos -2065998994 , end 2228968302. behold! ;-):

>>> pos = -2065998994 >>> end = 2228968302 >>> 2**32 + pos 2228968302l >>> 2**32 + pos == end true 

that is, pos , end "really" same. errmsg() called, means end , len(s) same - end being viewed 32-bit signed integer. end in turn comes regular expression match object's end() method.

so real problem here appears 32-bit limitation/assumption in regexp engine. encourage open bug report!

later: answer questions, yes, raw_decode() decoding entire file. other methods call raw_decode(), add (failing!) sanity checks afterwards.


Comments

Popular posts from this blog

java.util.scanner - How to read and add only numbers to array from a text file -

rewrite - Trouble with Wordpress multiple custom querystrings -