hadoop简单应用-统计文本文件单词个数

现在的位置: 首页 > 综合 > 正文

RSS

hadoop简单应用-统计文本文件单词个数

2013年08月17日 ⁄ 综合 ⁄ 共 3096字 ⁄ 字号小中大 ⁄ 评论关闭

=============hadoop-0.12.2-core 版本===========================

MyMap.java

map方法把文本文件单词输出到中间过程output中,格式：<key,value>

handoop 1

Bye 1

handoop 1

World 1

public class MyMap extends MapReduceBase implements Mapper {
	Text t = new Text();
	private final static IntWritable one = new IntWritable(1);
	private Text word = new Text();
	@Override
	public void map(WritableComparable key,
			Writable value, OutputCollector output,
			Reporter reporter) throws IOException {
		String line = value.toString();
		StringTokenizer stz = new StringTokenizer(line);
		while(stz.hasMoreTokens()){
			word.set(stz.nextToken());
			output.collect(word, one);
		}
	}
}

MyReduce.java

reduce方法

遍历values 就可以得到同一个key的所有value

public class MyReduce extends MapReduceBase implements Reducer {

	public void reduce(WritableComparable key, Iterator values, OutputCollector output,Reporter reporter) throws IOException {
			int sum = 0;
			while(values.hasNext()){
				sum+=Integer.parseInt(values.next().toString());
			}
			output.collect(key, new IntWritable(sum));
	}
}

任务，主调方法

public class JobTest{
	
	public int run(String... args) throws IOException{
		JobConf conf = new JobConf(new Configuration());
		conf.setJobName("wordCount");
		conf.setInputPath(new Path(args[0]));
		conf.setOutputPath(new Path(args[1]));
		conf.setMapperClass(MyMap.class);
		conf.setReducerClass(MyReduce.class);
		conf.setOutputKeyClass(Text.class);
		conf.setOutputValueClass(IntWritable.class);
		JobClient.runJob(conf);
		return 0;
		
	}
	
	public static void main(String[] args){
		try {
			new JobTest().run("D:\\files\\wordCount.txt","D:\\files\\wordCoutOut");
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

打开D:\files\wordCoutOut\part-00000文件如下结果：

Bye 3
Hadoop 4
Hello 3
World 2

===========hadoop-0.20.2-core版本========================

MyMap.java

public class MyMap extends Mapper<Object, Text, Text, IntWritable> {
	Text t = new Text();
	private final static IntWritable one = new IntWritable(1);
	private Text word = new Text();
	
	public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
		//output 和reporter 都集成到Context 中
		StringTokenizer itr = new StringTokenizer(value.toString());
		while(itr.hasMoreTokens()){
			word.set(itr.nextToken());
			context.write(word, one);
		}
	}
	
}

MyReduce.java

public class MyReduce extends Reducer<Text,IntWritable,Text,IntWritable> {
	private IntWritable result = new IntWritable();

	@Override
	protected void reduce(Text key, Iterable<IntWritable> values,Context context)
			throws IOException, InterruptedException {
		int sum = 0;
		for(IntWritable val:values){
			sum+=val.get();
		}
		result.set(sum);
		context.write(key, result);
	}
	
}

JobTest.java

public class JobTest{
	
	public int run(String... args) throws IOException, InterruptedException, ClassNotFoundException{
		Job job = new Job(new Configuration(),"word count");
		job.setJarByClass(JobTest.class);
		job.setMapperClass(MyMap.class);
		job.setCombinerClass(MyReduce.class);
		job.setReducerClass(MyReduce.class);
		job.setOutputKeyClass(Text.class);//设置reduce输出Key 类型
		job.setOutputValueClass(IntWritable.class);//设置输出value 类型
		FileInputFormat.addInputPath(job, new Path(args[0]));//设置输入路径
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		System.exit(job.waitForCompletion(true)?0:1);
		return 0;
	}
	
	public static void main(String[] args) throws InterruptedException, ClassNotFoundException{
		try {
			new JobTest().run("D:\\files\\wordCount.txt","D:\\files\\wordCoutOut");
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

【上篇】mciSendString 在实际使用中的总结
【下篇】[转]总结开发中的19个问题+若干问题

作者: ellipsis

该日志由 ellipsis 于11年前发表在综合分类下，最后更新于 2013年08月17日.
转载请注明: hadoop简单应用-统计文本文件单词个数 | 学步园 +复制链接

抱歉!评论已关闭.

学步园

hadoop简单应用-统计文本文件单词个数

作者: ellipsis

书签

最新文章New

本站推荐

返回首页