encoding - unicode:characters_to_list seems doesn't work for utf8 list -


i trying convert utf-8 string unicode (code point) list erlang library "unicode. input data string "АБВ" (russian string, correct unicode representation [1040,1041,1042]), encoded in utf-8. when running following code:

1> unicode:characters_to_list(<<208,144,208,145,208,146>>,utf8). [1040,1041,1042] 

it returns correct value, following:

2> unicode:characters_to_list([208,144,208,145,208,146],utf8).   [208,144,208,145,208,146] 

does not. why happens? read in specification, input data either binary or list of chars, so, me, doing right.

the signature of function unicode:characters_to_list(data, inencoding), expects data either binary containing string encoded in inencoding encoding or possibly deep list of characters (code points) , binaries in inencoding encoding. returns list of unicode characters. characters in erlang integers.

when call unicode:characters_to_list(<<208,144,208,145,208,146>>, utf8) or unicode:characters_to_list([1040,1041,1042], utf8) correctly decodes unicode string (yes, second noop long data list of integers). when call unicode:characters_to_list([208,144,208,145,208,146], utf8) erlang thinks pass list of 6 characters in utf8 encoding, since it's unicode output same.

there no byte type in erlang, assume unicode:characters_to_list/2 accept list of bytes , behave correctly.

to sum up. there 2 usual ways represent string in erlang, bitstrings , lists of characters. unicode:characters_to_list(data, inencoding) takes string data in 1 of these representations (or combination of them) in inencoding encoding , converts list of unicode codepoints.

if have list [208,144,208,145,208,146] in example can convert binary using erlang:list_to_binary/1 , pass unicode:characters_to_list/2, i.e.

1> unicode:characters_to_list(list_to_binary([208,144,208,145,208,146]), utf8). [1040,1041,1042] 

unicode module supports unicode , latin-1. thus, (since function expects codepoints of unicode or latin-1) characters_to_list not need list in case of flat list of codepoints. however, list may deep (unicode:characters_to_list([[1040],1041,<<1042/utf8>>]).). reason support list datatype data argument.


Comments

Popular posts from this blog

java.util.scanner - How to read and add only numbers to array from a text file -

rewrite - Trouble with Wordpress multiple custom querystrings -