logging - Aggregating Hadoop job counters across multiple jobs -


hadoop: (ver - 1.2.1), (1 + 8 node cluster)

my use case is, trying time taken execute specific pig script , how time being spent mapreduce point of view. need run pig script multiple times(say 100) average time. i've enabled pig.udf.profile gives me time spent on each udf function mapreduce counters. interested in other latency,memory metrics reported each job (cpu time,heap useage). can see these counters jobtracker web ui( host:50030/jobdetails.jsp?jobid=blah). now, question is, there way aggregate these counters across jobs. or, how build table looks like

                        run1   run2   run3 ... cpu time              |      |      | redcr wait            |      |      | udfcntr1(approx_us)   |      |      | udfcntr2(approx_invc) |      |      | countery(approx_us)   |      |      | 

each run different job far hadoop concerned. after grep'ing through log folder, figured out counters in history/done/.. folders. there existing technique combine results, or doomed write own parser goes through each log file. thought use case common enough existing solution - pointers helpful.

thanks.

you have couple of options, , apologize in advance none of them particularlly appealing.

implement ppnl

the pigprogressnotificationlistener java interface exposing events occur on course of pig job arbitrary clients. if implement interface , attach instance of class can grab hadoop counters (and many other m/r related metrics) , store them away later usage. note requires reasonable understand of pig internals, though not expert level of understanding.


use system gathers metrics

right options limited ambrose open sourced twitter, , lipstick open sourced netflix. not know sure if ambrose collects hadoop counters, extended do. lipstick collect hadoop counters is. either of these analyze counters varying level of difficulty, depending on how configured them store data.


parse log files

sounds you've thought going down route couple of reasons:

  1. all other options require know quite internals of project, pig, lipstick, or ambrose. if planning dive 1 of these other reasons go it, otherwise it's big investment single usecase.
  2. all other options limited pig jobs. aren't going data if run vanilla map/reduce job or start working other tooling hive, cascading, etc. if build log parsing tooling converts data known format gets processed should able reuse code (except actual parsing) hadoop tools.
  3. if data based on logs, , archive logs, can backfill if 6 months down road you're interested in data point hadn't collected before. isn't possible other approaches.
  4. it's established pattern works. can't point lot of online resources discussing people have taken approach, have met quite few people independently took approach , happy it.

Comments

Popular posts from this blog

c++ - CryptStringToBinary API behavior -

java.util.scanner - How to read and add only numbers to array from a text file -

iphone - Three second countdown in cocos2d -