欢迎投稿

今日深度:

Hadoop Streaming,

Hadoop Streaming,


(From the book Hadoop in Action, Section 4.5)


1. Streaming with Unix commands

$ bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -input input/input.txt -output output -mapper 'cut -f 2 -d ,' -reducer 'uniq'
$ bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -D mapred.reduce.tasks=0 -input output -output output_a -mapper 'wc -l'

The mapper directly output the record count without any reducer, so we set mapred.reduce.tasks to 0 and do not specify the -reducer option at all.


2. Streaming with scripts

For example, apply a python script in Hadoop to get a smaller sample of a data set. Below is RandomSample.py:

#!/usr/bin/env python

import sys, random

for line in sys.stdin:
	if (random.randint(1, 100) <= int(sys.argv[1])):
		print(line.strip())

Then, execute with the following command (on Cygwin in Windows):

$ bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -D mapred.reduce.tasks=1 -input workspace/data/cite75_99.txt -output workspace/outputa -mapper 'python2.7.exe workspace/RandomSample.py 10' -file workspace/RandomSample.py

Hadoop Streaming supports a -file option to package your executable file as part of the job submission.

As we have not specified any particular reducer, it will use the default IdentityReducer.

C++ code can also be applied: (compile .cpp to get .exe)

$ bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -D mapred.reduce.tasks=1 -input workspace/data/cite75_99.txt -output workspace/outputc -mapper 'workspace/RandomSample.exe 10' -file workspace/RandomSample.exe







www.htsjk.Com true http://www.htsjk.com/Hadoop/40825.html NewsArticle Hadoop Streaming, (From the book Hadoop in Action , Section 4.5) 1. Streaming with Unix commands $ bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -input input/input.txt -output output -mapper 'cut -f 2 -d ,' -reducer 'uniq'$...
相关文章
    暂无相关文章
评论暂时关闭