Python 3 and Unicode - How do I print newlines (general problems understanding this) -

i have sifted through lots , lots of python/unicode explanations can't seem make sense of this.

here situation:

i pulling loads of comments off reddit (making bot) , store them in mongodb, need able print out comment trees in order manually check what's going on.

i have had no problems far putting comments db, when try print stdout cp1252 charset having trouble characters doesn't support.

as have read, in python 3 internally (strings) stored unicode, it's input , output must bytes, fine - can encode unicode cp1252 , in couple of situations see \x** characters don't mind - guessing represent out of range characters?

the problem printing out comment trees (to stdout) using \n (linefeeds) , tabs easy over, apparently when encode unicode string newline escape sequences escapes them printed literals.

for reference here encode statement:

encoded = post.tree_to_string().encode('cp1252','ignore')

thanks

edit:

what want is

|parent comment      |child comment 1          |gchild comment 1      |child comment 2  |parent comment 2

what is

b"\n|parent comment \n\n |child comment \n\n etc

when printing console, python automatically encode strings in console's encoding (cp437 on windows) , raise exception character console encoding not support. example:

#!python3 #coding: utf8 print('some text\nwith chinese 美国\ncp1252 ÀÁÂÃ\nand cp437 ░▒▓')

output:

traceback (most recent call last):   file "c:\test.py", line 5, in <module>     print('some text\nwith chinese \u7f8e\u56fd\ncp1252 \xc0\xc1\xc2\xc3\nand cp437 ░▒▓')   file "c:\python33\lib\encodings\cp437.py", line 19, in encode     return codecs.charmap_encode(input,self.errors,encoding_map)[0] unicodeencodeerror: 'charmap' codec can't encode characters in position 24-25: character maps <undefined>

to change default, can alter stdout explicitly specify encoding , how handle errors:

#!python3 #coding: utf8 import io,sys sys.stdout = io.textiowrapper(sys.stdout.buffer,encoding=sys.stdout.encoding,errors='replace') print('some text\nwith chinese 美国\ncp1252 ÀÁÂÃ\nand cp437 ░▒▓')

output (to cp437 console):

some text chinese ?? cp1252 ???? , cp437 ░▒▓

you can explicitly without altering stdout, writing directly buffer interface:

sys.stdout.buffer.write('some text\nwith chinese 美国\ncp1252 ÀÁÂÃ\nand cp437 ░▒▓'.encode('cp437',errors='replace'))

a third alternative set following environment variable before starting python, alter stdout similar textiowrapper solution:

pythonioencoding=cp437:replace

finally, since mentioned writing file, easiest way see characters writing use utf-8 encoding file:

#!python3 #coding: utf8 open('out.txt','w',encoding='utf8') f:     f.write('some text\nwith chinese 美国\ncp1252 ÀÁÂÃ\nand cp437 ░▒▓')

Search This Blog

Bradly

Python 3 and Unicode - How do I print newlines (general problems understanding this) -

Comments

Post a Comment

Popular posts from this blog

java.util.scanner - How to read and add only numbers to array from a text file -

What is the end of string notation in python -

php - Add the correct number of days for each month -