encoding - unicode:characters_to_list seems doesn't work for utf8 list -
i trying convert utf-8 string unicode (code point) list erlang library "unicode. input data string "АБВ" (russian string, correct unicode representation [1040,1041,1042]), encoded in utf-8. when running following code:
1> unicode:characters_to_list(<<208,144,208,145,208,146>>,utf8). [1040,1041,1042]
it returns correct value, following:
2> unicode:characters_to_list([208,144,208,145,208,146],utf8). [208,144,208,145,208,146]
does not. why happens? read in specification, input data either binary or list of chars, so, me, doing right.
the signature of function unicode:characters_to_list(data, inencoding)
, expects data
either binary containing string encoded in inencoding
encoding or possibly deep list of characters (code points) , binaries in inencoding
encoding. returns list of unicode characters. characters in erlang integers.
when call unicode:characters_to_list(<<208,144,208,145,208,146>>, utf8)
or unicode:characters_to_list([1040,1041,1042], utf8)
correctly decodes unicode string (yes, second noop long data
list of integers). when call unicode:characters_to_list([208,144,208,145,208,146], utf8)
erlang thinks pass list of 6 characters in utf8
encoding, since it's unicode output same.
there no byte
type in erlang, assume unicode:characters_to_list/2
accept list of bytes
, behave correctly.
to sum up. there 2 usual ways represent string in erlang, bitstrings , lists of characters. unicode:characters_to_list(data, inencoding)
takes string data
in 1 of these representations (or combination of them) in inencoding
encoding , converts list of unicode codepoints.
if have list [208,144,208,145,208,146]
in example can convert binary using erlang:list_to_binary/1
, pass unicode:characters_to_list/2
, i.e.
1> unicode:characters_to_list(list_to_binary([208,144,208,145,208,146]), utf8). [1040,1041,1042]
unicode
module supports unicode , latin-1. thus, (since function expects codepoints of unicode or latin-1) characters_to_list
not need list in case of flat list of codepoints. however, list may deep (unicode:characters_to_list([[1040],1041,<<1042/utf8>>]).
). reason support list datatype data
argument.
Comments
Post a Comment