hadoop - HIVE delimiter \n ^M issue -


i have file columns delimited ^a , rows delimited '\n' new line character.

i first uploaded hdfs , create table in hive using command this:

create external table  if not exists  html_sample  ( ts string,    url string,    html string)  row format delimited  fields terminated '\001'  lines terminated '\n'  location '/tmp/directoryname/'; 

however, when select statement table. turned out mess.

the table looks this:

ts              url                    html 10082013        http://url.com/01      <doctype>.....style="padding-top: 10px; text-align...   null                   null  text-align...   null                   null text-align...   null                   null 10092013        http://url.com/02      <doctype>.....style="padding-top: 10px; text-align...   null                   null  text-align...   null                   null text-align...   null                   null 

then went text file , found there exist several ^m characters in file, makes hive treat ^m new line character.

when first created file, intentionally removed new line character html guarantee each record 1 line. however, cannot understand how on earth hive treat ^m newline character. how can around without modifying file.

(i know might possible global substitution in vi or sed... doesn't make sense me how hive treat ^m \n)

^m way in vim displays windows line endings. here's more on this: what ^m character mean in vim?

and hive in turn uses textinputformat happens treat valid line terminator.

depending on versions of hadoop , hive you're using there can different ways overcome this(from changing property in config custom inputformat implementation).

just find way specify separator explicitly.

and yeah, lines terminated '\n' not looks like. i'm using hive 0.11 , possible value '\n' not promoted textinputformat


Comments

Popular posts from this blog

java.util.scanner - How to read and add only numbers to array from a text file -

rewrite - Trouble with Wordpress multiple custom querystrings -