Python 3 and Unicode - How do I print newlines (general problems understanding this) -
i have sifted through lots , lots of python/unicode explanations can't seem make sense of this.
here situation:
i pulling loads of comments off reddit (making bot) , store them in mongodb, need able print out comment trees in order manually check what's going on.
i have had no problems far putting comments db, when try print stdout cp1252 charset having trouble characters doesn't support.
as have read, in python 3 internally (strings) stored unicode, it's input , output must bytes, fine - can encode unicode cp1252 , in couple of situations see \x** characters don't mind - guessing represent out of range characters?
the problem printing out comment trees (to stdout) using \n (linefeeds) , tabs easy over, apparently when encode unicode string newline escape sequences escapes them printed literals.
for reference here encode statement:
encoded = post.tree_to_string().encode('cp1252','ignore')
thanks
edit:
what want is
|parent comment |child comment 1 |gchild comment 1 |child comment 2 |parent comment 2
what is
b"\n|parent comment \n\n |child comment \n\n etc
when printing console, python automatically encode strings in console's encoding (cp437
on windows) , raise exception character console encoding not support. example:
#!python3 #coding: utf8 print('some text\nwith chinese 美国\ncp1252 ÀÁÂÃ\nand cp437 ░▒▓')
output:
traceback (most recent call last): file "c:\test.py", line 5, in <module> print('some text\nwith chinese \u7f8e\u56fd\ncp1252 \xc0\xc1\xc2\xc3\nand cp437 ░▒▓') file "c:\python33\lib\encodings\cp437.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] unicodeencodeerror: 'charmap' codec can't encode characters in position 24-25: character maps <undefined>
to change default, can alter stdout
explicitly specify encoding , how handle errors:
#!python3 #coding: utf8 import io,sys sys.stdout = io.textiowrapper(sys.stdout.buffer,encoding=sys.stdout.encoding,errors='replace') print('some text\nwith chinese 美国\ncp1252 ÀÁÂÃ\nand cp437 ░▒▓')
output (to cp437 console):
some text chinese ?? cp1252 ???? , cp437 ░▒▓
you can explicitly without altering stdout
, writing directly buffer
interface:
sys.stdout.buffer.write('some text\nwith chinese 美国\ncp1252 ÀÁÂÃ\nand cp437 ░▒▓'.encode('cp437',errors='replace'))
a third alternative set following environment variable before starting python, alter stdout
similar textiowrapper
solution:
pythonioencoding=cp437:replace
finally, since mentioned writing file, easiest way see characters writing use utf-8 encoding file:
#!python3 #coding: utf8 open('out.txt','w',encoding='utf8') f: f.write('some text\nwith chinese 美国\ncp1252 ÀÁÂÃ\nand cp437 ░▒▓')
Comments
Post a Comment