Hadoop示例程序WordCount运行及详解 -

samuschen

浏览: 397760 次
性别:
来自: 北京

最近访客更多访客>>

dy.f

u012363178

谁谁谁

wangyy

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Hadoop示例程序WordCount运行及详解

博客分类：

Hadoop

Hadoop Apache Linux 框架编程

最近在学习云计算，研究Haddop框架，费了一整天时间将Hadoop在Linux下完全运行起来，看到官方的map-reduce的demo程序WordCount，仔细研究了一下，算做入门了。

运行方法：

假设：

/home/cq/wordcount/input - 是 HDFS 中的输入路径
/home/cq/wordcount/output - 是 HDFS 中的输出路径

用示例文本文件做为输入：

$ bin/hadoop fs -ls /home/cq/wordcount/input/
/ home/cq /wordcount/input/file01
/ home/cq /wordcount/input/file02

$ bin/hadoop fs -cat / home/cq / wordcount/input/file01
Hello World Bye World

$ bin/hadoop dfs -cat /home/cq/wordcount/input/file02
Hello Hadoop Goodbye Hadoop

运行应用程序：

$ bin/hadoop jar /*/WordCount /home/cq/wordcount/input /home/cq/wordcount/output

输出是：

$ bin/hadoop dfs -cat /home/cq/wordcount/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2

详解：

其实WordCount并不难，只是一下子接触到了很多的API，有一些陌生，还有就是很传统的开发相比，map-reduce确实是一种新的编程理念，为了让各位新手少走弯路，我将WordCount中的很多API都做了注释，其实这些方法搞明白了以后程序就很简单了，无非就是将一句话分词，先用 map处理再用reduce处理，最后再main函数中设置一些信息，然后run()，程序就结束了。好了，不废话，直接上代码：

Java代码

package com.felix;
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
/**
*
* 描述：WordCount explains by Felix
* @author Hadoop Dev Group
*/
public class WordCount
{
/**
* MapReduceBase类:实现了Mapper和Reducer接口的基类（其中的方法只是实现接口，而未作任何事情）
* Mapper接口：
* WritableComparable接口：实现WritableComparable的类可以相互比较。所有被用作key的类应该实现此接口。
* Reporter 则可用于报告整个应用的运行进度，本例中未使用。
*
*/
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable>
{
/**
* LongWritable, IntWritable, Text 均是 Hadoop 中实现的用于封装 Java 数据类型的类，这些类实现了WritableComparable接口，
* 都能够被串行化从而便于在分布式环境中进行数据交换，你可以将它们分别视为long,int,String 的替代品。
*/
private final static IntWritable one = new IntWritable( 1 );
private Text word = new Text();
/**
* Mapper接口中的map方法：
* void map(K1 key, V1 value, OutputCollector<K2,V2> output, Reporter reporter)
* 映射一个单个的输入k/v对到一个中间的k/v对
* 输出对不需要和输入对是相同的类型，输入对可以映射到0个或多个输出对。
* OutputCollector接口：收集Mapper和Reducer输出的<k,v>对。
* OutputCollector接口的collect(k, v)方法:增加一个(k,v)对到output
*/
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException
{
int sum = 0 ;
while (values.hasNext())
{
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception
{
/**
* JobConf：map/reduce的job配置类，向hadoop框架描述map-reduce执行的工作
* 构造方法：JobConf()、JobConf(Class exampleClass)、JobConf(Configuration conf)等
*/
JobConf conf = new JobConf(WordCount. class );
conf.setJobName("wordcount" ); //设置一个用户定义的job名称
conf.setOutputKeyClass(Text.class ); //为job的输出数据设置Key类
conf.setOutputValueClass(IntWritable.class ); //为job输出设置value类
conf.setMapperClass(Map.class ); //为job设置Mapper类
conf.setCombinerClass(Reduce.class ); //为job设置Combiner类
conf.setReducerClass(Reduce.class ); //为job设置Reduce类
conf.setInputFormat(TextInputFormat.class ); //为map-reduce任务设置InputFormat实现类
conf.setOutputFormat(TextOutputFormat.class ); //为map-reduce任务设置OutputFormat实现类
/**
* InputFormat描述map-reduce中对job的输入定义
* setInputPaths():为map-reduce job设置路径数组作为输入列表
* setInputPath()：为map-reduce job设置路径数组作为输出列表
*/
FileInputFormat.setInputPaths(conf, new Path(args[ 0 ]));
FileOutputFormat.setOutputPath(conf, new Path(args[ 1 ]));
JobClient.runJob(conf); //运行一个job
}
}

转自http://www.iteye.com/topic/606962

分享到：

Java内存分配原理精讲 | Hbase

2010-09-15 15:14
浏览 12734
评论(1)
分类:企业架构
查看更多

1 楼 darrendu 2012-04-07

执行这个命令，bin/hadoop fs -ls /home/darren/wordcount/input/ 为什么老是提示：
Cannot access /home/darren/wordcount/input/: No such file or directory.

下面是详情，帮看下

Administrator@china-26d105f11 /cygdrive/e/hadoop/run
$ bin/hadoop fs -ls /home/darren/wordcount/input/
ls: Cannot access /home/darren/wordcount/input/: No such file or directory.

Administrator@china-26d105f11 /cygdrive/e/hadoop/run
$ ls /home/darren/wordcount/input/
file01  file02

Administrator@china-26d105f11 /cygdrive/e/hadoop/run
$ cat /home/darren/wordcount/input/file01
hello world bye world


Administrator@china-26d105f11 /cygdrive/e/hadoop/run
$ vi /home/darren/wordcount/input/file01

Administrator@china-26d105f11 /cygdrive/e/hadoop/run
$ cat /home/darren/wordcount/input/file01
hello world bye world
test

Administrator@china-26d105f11 /cygdrive/e/hadoop/run
$

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Hadoop示例程序WordCount运行及详解

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Hadoop示例程序WordCount运行及详解

评论

发表评论

相关推荐

Hadoop的Secondary NameNode方案

hadoop

hadoop作业调优参数整理及原理

Job运行流程分析

hadoop作业运行部分源码

eclipse中编译hadoop（hive）源码

Configuration Parameters: What can you just ignore?

7 Tips for Improving MapReduce Performance

hadoop 源码分析一

hadoop参数配置（mapreduce数据流）

混洗和排序

hadoop中每个节点map和reduce个数的设置调优

hadoop profiling

关于JVM内存设置

HADOOP报错Incompatible namespaceIDs

node1-node6搭建hadoop

hadoop启动耗时

namenode 内部关键数据结构简介

HDFS常用命令

HDFS添加和删除节点

最近访客更多访客>>