hadoop - HIVE delimiter \n ^M issue -
i have file columns delimited ^a , rows delimited '\n' new line character.
i first uploaded hdfs , create table in hive using command this:
create external table if not exists html_sample ( ts string, url string, html string) row format delimited fields terminated '\001' lines terminated '\n' location '/tmp/directoryname/';
however, when select statement table. turned out mess.
the table looks this:
ts url html 10082013 http://url.com/01 <doctype>.....style="padding-top: 10px; text-align... null null text-align... null null text-align... null null 10092013 http://url.com/02 <doctype>.....style="padding-top: 10px; text-align... null null text-align... null null text-align... null null
then went text file , found there exist several ^m characters in file, makes hive treat ^m new line character.
when first created file, intentionally removed new line character html guarantee each record 1 line. however, cannot understand how on earth hive treat ^m newline character. how can around without modifying file.
(i know might possible global substitution in vi or sed... doesn't make sense me how hive treat ^m \n)
^m way in vim displays windows line endings. here's more on this: what ^m character mean in vim?
and hive in turn uses textinputformat happens treat valid line terminator.
depending on versions of hadoop , hive you're using there can different ways overcome this(from changing property in config custom inputformat implementation).
just find way specify separator explicitly.
and yeah, lines terminated '\n' not looks like. i'm using hive 0.11 , possible value '\n' not promoted textinputformat
Comments
Post a Comment