hadoop的wordcount实例代码-eolink官网

hadoop的wordcount实例代码

可以通过一个简单的例子来说明MapReduce到底是什么：

我们要统计一个大文件中的各个单词出现的次数。由于文件太大。我们把这个文件切分成如果小文件，然后安排多个人去统计。这个过程就是”Map”。然后把每个人统计的数字合并起来，这个就是“Reduce"。

上面的例子如果在MapReduce去做呢，就需要创建一个任务job，由job把文件切分成若干独立的数据块，并分布在不同的机器节点中。然后通过分散在不同节点中的Map任务以完全并行的方式进行处理。MapReduce会对Map的输出地行收集，再将结果输出送给Reduce进行下一步的处理。

对于一个任务的具体执行过程，会有一个名为"JobTracker"的进程负责协调MapReduce执行过程中的所有任务。若干条TaskTracker进程用来运行单独的Map任务，并随时将任务的执行情况汇报给JobTracker。如果一个TaskTracker汇报任务失败或者长时间未对本身任务进行汇报，JobTracker会启动另外一个TaskTracker重新执行单独的Map任务。

下面的具体的代码实现：

1. 编写wordcount的相关job

(1)eclipse下创建相关maven项目，依赖jar包如下（也可参照hadoop源码包下的hadoop-mapreduce-examples项目的pom配置）

注意：要配置一个maven插件maven-jar-plugin，并指定mainClass

junit

4.11

org.apache.hadoop

hadoop-mapreduce-client-core

2.5.2

org.apache.hadoop

hadoop-common

2.5.2

org.apache.maven.plugins

maven-jar-plugin

com.xxx.demo.hadoop.wordcount.WordCount

(2)根据MapReduce的运行机制，一个job至少要编写三个类分别用来完成Map逻辑、Reduce逻辑、作业调度这三件事。

Map的代码可继承org.apache.hadoop.mapreduce.Mapper类

public static class TokenizerMapper

extends Mapper

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

//由于该例子未用到key的参数，所以该处key的类型就简单指定为Object

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

Reduce的代码可继承org.apache.hadoop.mapreduce.Reducer类

public class IntSumReducer

extends Reducer {

private IntWritable result = new IntWritable();

public void reduce(Text key, IterwqupMluJjRable values,

Context context

) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

编写main方法进行作业调度

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true) ;

//System.exit(job.waitForCompletion(true) ? 0 : 1);

}

2. 上传数据文件到hadoop集群环境

执行mvn install把项目打成jar文件然后上传到linux集群环境，使用hdfs dfs -mkdir命令在hdfs文件系统中创建相应的命令，使用hdfs dfs -put 把需要处理的数据文件上传到hdfs系统中，示例：hdfs dfs -put ${linux_path/数据文件} ${hdfs_path}

3. 执行job

在集群环境中执行命令: hadoop jar ${linux_path}/wordcount.jar ${hdfs_input_path} ${hdfs_output_path}

4. 查看统计结果

hdfs dfs -cat ${hdfs_output_path}/输出文件名

以上的方式在未启动hadoop集群环境时，是以Local模式运行，此时HDFS和YARN都不起作用。下面是在伪分布式模式下执行mapreduce job时需要做的工作，先把官网上列的步骤摘录出来：

配置主机名

# vi /etc/sysconfig/network

例如：

NETWORKING=yes

HOSTNAME=master

vi /etc/hosts

填入以下内容

127.0.0.1 localhost

配置ssh免密码互通

ssh-keygen -t rsa

# cat?~/.ssh/id_rsa.pub?>>?~/.ssh/authorized_keys

配置core-site.xml文件（位于${HADOOP_HOME}/etc/hadoop/

fs.defaultFS

hdfs://localhost:9000

配置hdfs-site.xml文件

dfs.replication

下面的命令可以在单机伪分布模式下运行mapreduce的job

1.Format the filesystem:

$ bin/hdfs namenode -format

2.Start NameNode daemon and DataNode daemon:

$ sbin/start-dfs.sh

3.The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).

4.Browse the web interface for the NameNode; by default it is available at:

NameNode - http://localhost:50070/

Make the HDFS directories required to execute MapReduce jobs:

$ bin/hdfs dfs -mkdir /user

$ bin/hdfs dfs -mkdir /user/

5.Copy the input files into the distributed filesystem:

$ bin/hdfs dfs -put etc/hadoop input

6.Run some of the examples provided:

$ bin/hadoowqupMluJjRp jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar grep input output 'dfs[a-z.]+'

7.Examine the output files:

Copy the output files from the distributed filesystem to the local filesystem and examine them:

$ bin/hdfs dfs -get output output

$ cat output/*

View the output files on the distributed filesystem:

$ bin/hdfs dfs -cat output/*

8.When you're done, stop the daemons with:

$ sbin/stop-dfs.sh

总结

以上就是本文关于hadoop的wordcount实例代码的全部内容，希望对大家有所帮助。感兴趣的朋友可以继续参阅本站其他相关专题，如有不足之处，欢迎留言指出。感谢朋友们对本站的支持！

Flask接口签名sign原理与实例代码浅析

290 2023-02-19

hadoop的wordcount实例代码

Flask接口签名sign原理与实例代码浅析

java中的接口是类吗

vue项目接口域名动态的获取方法

推荐文章

接口调用是什么意思？几种常用接口调用方式

接口设计原则

8款在线 API 接口文档管理工具

api管理系统是什么？

什么是接口调试？接口调试的步骤有哪些？

api 接口管理系统有哪些？

接口测试有几种测试方法

API文档生成工具有哪些？

微服务和api网关区别

交换机配置步骤

最近发表

热评文章

在线接口文档管理工具推荐，支持在线测试，HTTP接口

开源的在线接口文档wiki工具Mindoc的介绍与使

如何优雅的进行接口设计？接口设计的六大原则是什么？

什么是API测试,api检测公司

遇到百度网址安全中心提醒您该页面可能存在钓鱼欺诈信息

软件接口设计怎么做？前后端分离软件接口设计思路

hadoop的wordcount实例代码

微信扫一扫：分享

推荐文章

最近发表

热评文章