第5章MapReduce进阶编程 Ver1.2-20230523

Imagemap

hide

第5章MapReduce进阶编程
Ver1.2-20230523

hide

hide

leaf

主要介绍MapReduce编程的进阶，包括MapReduce的输出及输入格式、Hadoop Java API、自定义键值类型、Combiner组件、Partitioner组件、自定义计数器以及Eclipse提交MapReduce任务。其中，自定义键值类型、Combiner组件和Partitioner组件对程序的优化起到了举足轻重的作用，它们在一定程度上可以提高程序运行的效率

hide

hide

leaf

http://i.hddly.cn/media/bdeUXxKQpl.mp4

User Link

hide

任务5.1筛选日志文件并生成序列化文件

hide

获取测试数据

leaf

说明:使用crt进入集群的master或slave节点,使用user_login.txt作为数据分析对象,如hdfs存在则不用重传

leaf

cd /root/hadoop
wget https://hddly.oss-cn-hangzhou.aliyuncs.com/down/file/user_login.tar.gz
tar -zxvf ./user_login.tar.gz
hdfs dfs -put /root/hadoop/user_login.txt /user/myname/user_login.txt

hide

leaf

创建类： SelectData

hide

leaf

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

hide

添加Mapper类

leaf

public static class SelectDataMapper extends Mapper<LongWritable, Text,Text,Text> {
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text,Text,Text>.Context context)
throws IOException, InterruptedException {
String[] val=value.toString().split(",");
if(val[1].contains("2016-01") || val[1].contains("2016-02")){
context.write(new Text(val[0]),new Text(val[1]));
}
}

}

hide

添加Driver代码

leaf

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://master:9864");
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length < 2) {
otherArgs = new String[] { "/user/myname/user_login.txt", "/user/myname/output_SelectData" };
} // myname要改为自已的姓名拼音
Job job = Job.getInstance(conf, "selectdata");
job.setJarByClass(SelectData.class);
job.setMapperClass(SelectDataMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);// 设置输入格式
job.setOutputFormatClass(SequenceFileOutputFormat.class);// 设置输出格式
job.setNumReduceTasks(0);// 设置Reducer任务数为0
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileSystem.get(conf).delete(new Path(otherArgs[1]), true);
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.err.println(job.waitForCompletion(true) ? -1 : 1);
}

hide

运行SelectData类方法

leaf

右击类代码->Run As->Java Application

leaf

观察运行结果

hide

leaf

右击MemberCount项目->Export->Runnable JAR file->

leaf

Launch configuration:选择刚刚运行的SelectData

leaf

Export destinations:d:\soft\hadoop\SelectData.jar

leaf

Library handing: 选择Extract required libraries into generated JAR

leaf

导出过程有任何提示，直接点确认

hide

上传jar包到master

leaf

通过crt在master上打开 sftp

leaf

lcd d:\soft\hadoop\

leaf

cd /root/hadoop

leaf

put SelectData.jar

hide

leaf

hadoop jar SelectData.jar /user/myname/user_login.txt /user/myname/output_SelectData

hide

任务5.2Hadoop Java API读取序列化日志文件

hide

获取测试数据

leaf

说明:使用crt进入集群的master或slave节点,使用user_login.txt作为数据分析对象,如hdfs存在则不用重传

leaf

cd /root/hadoop
wget https://hddly.oss-cn-hangzhou.aliyuncs.com/down/file/user_login.tar.gz
tar -zxvf ./user_login.tar.gz
hdfs dfs -put /root/hadoop/user_login.txt /user/myname/user_login.txt

leaf

在本地d:盘创建tmp目录，作为上下传文件的目录：D:\tmp

leaf

使用上节SelectData对像生成目标文件:/user/myname/output_SelectData

hide

FileSystemAPI获取实例

hide

接口官方文档

leaf

https://hadoop.apache.org/docs/stable/api/index.html

User Link

hide

获取FileSystem

leaf

get(Configuration conf)
Returns the configured FileSystem implementation.

leaf

get(URI uri, Configuration conf)
Get a FileSystem for this URI's scheme and authority.

leaf

get(URI uri, Configuration conf, String user)
Get a FileSystem instance based on the uri, the passed in configuration and the user.

hide

FileSystemAPI实例

hide

hide

messagebox_warning

远程home.hddly.cn暂有异常

leaf

涉及文件读和写的都不可以

leaf

原因是无法远程访问内网的IP

leaf

button_ok

可以远程home.hddly.cn

hide

button_ok

S1_ListDir列出文件夹

leaf

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class ListDir {
public static void main(String[] args) throws IOException {
//获取配置
Configuration conf=new Configuration();
conf.set("fs.defaultFS", "hdfs://home.hddly.cn:9864/");
//获取文件系统
FileSystem fs=FileSystem.get(conf);
//声明文件路径
Path path=new Path("/user/limm/");
//获取文件列表
FileStatus[] fileStatuses=fs.listStatus(path);
//遍历文件列表
for (FileStatus file : fileStatuses) {
//判断是否是文件夹还是文件
if(file.isDirectory()){
System.out.println("Dir:" +file.getPath().toString());
}
else if (file.isFile())
{
System.out.println("File:" +file.getPath().toString());
}
else
{
System.out.println("Other:" +file.getPath().toString());
}
}
}
}

hide

button_ok

S2_CreateDir创建目录

leaf

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class CreateDir {
public static void main(String[] args) throws IOException {
//获取配置
Configuration conf=new Configuration();
conf.set("fs.defaultFS", "hdfs://home.hddly.cn:9864/");
//获取文件系统
FileSystem fs=FileSystem.get(conf);
////声明创建的目录
Path path=new Path("/user/myname/temp"); //myname改为本人
//调用mkdirs函数创建目录
fs.mkdirs(path);
//关闭文件
fs.close();
}

}

hide

messagebox_warning

S3_CopyToLocal下载文件

leaf

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class CopyToLocal {
public static void main(String[] args) throws IOException {
//获取配置
Configuration conf=new Configuration();
conf.set("fs.defaultFS", "hdfs://master:9864/");
//获取文件系统
FileSystem fs=FileSystem.get(conf);
//声明源文件路径和目标路径
Path fromPath=new Path("/user/myname/user_login.txt");
Path toPath=new Path("D:/tmp");
//调用copyToLocalFile方法下载文件到本地
fs.copyToLocalFile(false, fromPath, toPath, true);
//关闭文件系统
fs.close();
}
}

hide

messagebox_warning

S4_CopyFromLocal上传文件

leaf

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class CopyFromLocal {
public static void main(String[] args) throws IOException {
//获取配置
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://home.hddly.cn:9864/");
//获取文件系统
FileSystem fileSystem = FileSystem.get(conf);
//FileSystem fileSystem = FileSystem.get(URI.create("hdfs://home.hddly.cn:9864/"),conf,"myname");
//声明源文件路径和目标路径
Path fromPath = new Path("D:/tmp/user_login.txt");
Path toPath = new Path("/user/myname/temp/user_log.txt");
//调用copyFromLocalFile方法上传文件
fileSystem.copyFromLocalFile(fromPath,toPath);
//关闭文件系统
fileSystem.close();
}

}

hide

messagebox_warning

S5_CatFile读写文件

leaf

public class CatFile {
public static void main(String[] args) throws IOException {
//获取配置
Configuration conf=new Configuration();
conf.set("fs.defaultFS", "hdfs://home.hddly.cn:9864/");
//获取文件系统
FileSystem fs=FileSystem.get(conf);
//声明查看的路径
Path path=new Path("/user/myname/temp/user_log.txt");
//创建新文件
Path newPath=new Path("/user/myname/temp/new_user_log.txt");
fs.delete(newPath,true);
FSDataOutputStream os=fs.create(newPath);
//获取指定文件的数据字节流
FSDataInputStream is=fs.open(path);
//读取文件内容并写入到新文件
BufferedReader br=new BufferedReader(new InputStreamReader(is,"utf-8"));
BufferedWriter bw=new BufferedWriter(new OutputStreamWriter(os,"utf-8"));
String line="";
while((line=br.readLine())!=null){
bw.write(line);
bw.newLine();
}
//关闭数据字节流
bw.close();
os.close();
br.close();
is.close();
//关闭文件系统
fs.close();
}

}

hide

button_ok

S6_ListFile列出文件

leaf

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class ListFile {
public static void main(String[] args) throws IOException {
//获取配置
Configuration conf=new Configuration();
conf.set("fs.defaultFS", "hdfs://home.hddly.cn:9864/");
//获取文件系统
FileSystem fs=FileSystem.get(conf);
//声明文件路径
Path path=new Path("/user/myname");
//获取文件列表
FileStatus[] fileStatuses=fs.listStatus(path);
//遍历文件列表
for (FileStatus file : fileStatuses) {
//判断是否是文件夹
if(file.isFile()){
System.out.println(file.getPath().toString());
}
}
//关闭文件系统
fs.close();
}
}

hide

button_ok

S7_DelFile删除文件

leaf

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class DelFile {
public static void main(String[] args) throws IOException {
//获取配置
Configuration conf=new Configuration();
conf.set("fs.defaultFS", "hdfs://home.hddly.cn:9864");
//获取文件系统
FileSystem fs=FileSystem.get(conf);
//声明文件路径
Path path=new Path("/user/myname/temp/user_log.txt");
//删除文件
fs.delete(path, true);
//关闭文件系统
fs.close();
}
}

hide

button_ok

S8_DelPath删除目录

leaf

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class DelPath {
public static void main(String[] args) throws IOException {
//获取配置
Configuration conf=new Configuration();
conf.set("fs.defaultFS", "hdfs://home.hddly.cn:9864");
//获取文件系统
FileSystem fs=FileSystem.get(conf);
//声明文件路径
Path path=new Path("/user/myname/temp/");
//删除文件
fs.delete(path, true);
//关闭文件系统
fs.close();
}
}

hide

读取序列化文件

hide

messagebox_warning

DownloadFile
读序列化文件

leaf

messagebox_warning

注明：需要修改主机名称，本组员目录名称，本地d盘需要创建tmp目录

leaf

package chap5_selectdata;

import java.io.BufferedWriter;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

public class DownloadFile {
public static void main(String[] args) throws IOException {
//获取配置
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://c31:9864/");
//获取文件系统
FileSystem fs = FileSystem.get(conf);
//获取SequenceFile.Reader对象
SequenceFile.Reader reader = new SequenceFile.Reader(fs,
new Path("/user/limm/output_SelectData/part-m-00000"), conf);
//获取序列化文件中使用的键值类型
Text key = new Text();
Text value = new Text();
BufferedWriter out = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream("d:\\tmp\\selectdata.txt", true)));
while (reader.next(key, value)) {
out.write(key.toString() + "\t" + value.toString() + "\r\n");
}
out.close();
reader.close();
System.out.println("end");
}
}

hide

运行DownloadFile类main方法

leaf

右击类代码->Run As->Java Application

leaf

观察运行结果

hide

leaf

右击MemberCount项目->Export->Runnable JAR file->

leaf

Launch configuration:选择刚刚运行的DownloadFile

leaf

Export destinations:d:\soft\hadoop\downfile.jar

leaf

Library handing: 选择Extract required libraries into generated JAR

leaf

导出过程有任何提示，直接点确认

hide

在windows上运行jar

leaf

通过cmd,进入d:\tmp目录:cd d:\tmp

leaf

java -jar downfile.jar

leaf

打开d:\\tmp\\selectdata.txt文件，预期内容无乱码

hide

任务5.3优化日志文件统计程序

hide

MR优化关键因素

hide

MR实现日志按月份统计

leaf

任务目标:MR编程实现用户在2016年1月和2月份每天的登录次数统计

hide

leaf

基于5.1的输出结果:hadoop jar SelectData.jar /user/myname/user_login.txt /user/myname/output_SelectData
请确认以下hdfs上的文件是否有内容
/uer/myname/output_SelectData/part-m-00000

hide

leaf

在MemberCount工程下创建包logcount,以下类都在该包下创建

hide

自定义类型MemberLogTime

leaf

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;

public class MemberLogTime implements WritableComparable<MemberLogTime>{
private String member_name;
private String logTime;
public MemberLogTime() {

}
public MemberLogTime(String member_name,String logTime){
this.member_name=member_name;
this.logTime=logTime;
}
public String getMember_name() {
return member_name;
}
public void setMember_name(String member_name) {
this.member_name = member_name;
}
public String getLogTime() {
return logTime;
}
public void setLogTime(String logTime) {
this.logTime = logTime;
}
@Override
public void readFields(DataInput in) throws IOException {
this.member_name=in.readUTF();
this.logTime=in.readUTF();
}

@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(member_name);
out.writeUTF(logTime);
}
@Override
public int compareTo(MemberLogTime o) {
return this.getMember_name().compareTo(o.getMember_name());
}
@Override
public String toString() {
return this.member_name+","+this.logTime;
}
}

hide

LogCountMapper实现

leaf

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class LogCountMapper extends Mapper<Text, Text, MemberLogTime, IntWritable> {
private MemberLogTime mt=new MemberLogTime();
private IntWritable one=new IntWritable(1);
enum LogCounter{
January,
February
}
@Override
protected void map(Text key, Text value, Mapper<Text, Text, MemberLogTime, IntWritable>.Context context)
throws IOException, InterruptedException {
String member_name=key.toString();
String logTime=value.toString();
if(logTime.contains("2016-01")){
context.getCounter(LogCounter.January).increment(1);;
}else if(logTime.contains("2016-02")){
context.getCounter(LogCounter.February).increment(1);;
}
mt.setMember_name(member_name);
mt.setLogTime(logTime);
context.write(mt, one);
}
}

hide

LogCountCombiner实现

leaf

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;

public class LogCountCombiner extends Reducer<MemberLogTime, IntWritable, MemberLogTime, IntWritable> {
@Override
protected void reduce(MemberLogTime key, Iterable<IntWritable> value,
Reducer<MemberLogTime, IntWritable, MemberLogTime, IntWritable>.Context context)
throws IOException, InterruptedException {
int sum=0;
for (IntWritable val : value) {
sum+=val.get();
}
context.write(key, new IntWritable(sum));
}
}

hide

LogCountPartitioner实现

leaf

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Partitioner;
public class LogCountPartitioner extends Partitioner<MemberLogTime, IntWritable> {
@Override
public int getPartition(MemberLogTime key, IntWritable value, int numPartitions) {
String date=key.getLogTime();
if(date.contains("2016-01")){
return 0%numPartitions;
}else{
return 1%numPartitions;
}
}
}

hide

LogCountReducer实现

leaf

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
public class LogCountReducer extends Reducer<MemberLogTime, IntWritable, MemberLogTime, IntWritable> {
@Override
protected void reduce(MemberLogTime key, Iterable<IntWritable> value,
Reducer<MemberLogTime, IntWritable, MemberLogTime, IntWritable>.Context context)
throws IOException, InterruptedException {
if(key.getLogTime().contains("2016-01")){
context.getCounter("OutputCounter","JanuaryResult").increment(1);;
}else if(key.getLogTime().contains("2016-02")){
context.getCounter("OutputCounter", "FebruaryResult").increment(1);
}
int sum=0;
for (IntWritable val : value) {
sum+=val.get();
}
context.write(key, new IntWritable(sum));
}
}

hide

LogCount驱动类实现

leaf

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileAsTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class LogCount {

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://home.hddly.cn:9864");
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length < 2) {
otherArgs= new String[]{"/user/myname/output_SelectData/part-m-00000","/user/myname/output_MonthData"};
//注意myname处改为自已姓名全拼
}
Job job = Job.getInstance(conf, "logcount");
job.setJarByClass(LogCount.class);
job.setMapperClass(LogCountMapper.class);
job.setReducerClass(LogCountReducer.class);
job.setCombinerClass(LogCountCombiner.class);
job.setPartitionerClass(LogCountPartitioner.class);
job.setNumReduceTasks(2);

job.setOutputKeyClass(MemberLogTime.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(SequenceFileAsTextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileSystem.get(conf).delete(new Path(otherArgs[1]), true);
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

System.err.println(job.waitForCompletion(true) ? -1 : 1);
}

}

hide

leaf

运行LogCount->main()

hide

hide

1,查看/user/myname/output_MonthData目录下是否有两文件

leaf

leaf

hide

2,查看part-r-00000和part-r-00001内容

leaf

如:Aaron,2016-01-04 99

hide

leaf

右击MemberCount项目->Export->Runnable JAR file->

leaf

Launch configuration:选择刚刚运行的LogCount

leaf

Export destinations:d:\soft\hadoop\LogCount.jar

leaf

Library handing: 选择Extract required libraries into generated JAR

leaf

导出过程有任何提示，直接点确认

hide

上传jar包到master

leaf

通过crt在master上打开 sftp

leaf

lcd d:\soft\hadoop\

leaf

cd /root/hadoop

leaf

put LogCount.jar

hide

leaf

hadoop jar LogCount.jar /user/myname/output_SelectData/part-m-00000 /user/myname/output_MonthData

hide

任务5.4Eclipse提交日志文件统计程序

hide

leaf

在eclipse上直接运行mr

leaf

将分割符作为参数传入,以应用源文件格式的变化

leaf

优化:使用ToolRunner调用方法

leaf

优化:独立方法用来设置Hadoop集群的配置

leaf

优化:添加日志,方便用户查看日志

leaf

优化:使用JarUtil工具类直接运行 jar包

hide

leaf

在main方法中调用ToolRunner里的run(Configuration conf,Tool tool,String[] args)来运行应用程序

hide

MR实现日志按月份统计2

leaf

任务目标:MR编程实现用户在2016年1月和2月份每天的登录次数统计

hide

leaf

基于5.1的输出结果:hadoop jar SelectData.jar /user/myname/user_login.txt /user/myname/output_SelectData
请确认以下hdfs上的文件是否有内容
/uer/myname/output_SelectData/part-m-00000

hide

leaf

在MemberCount工程下创建包logcount,以下类都在该包下创建

hide

自定义类型MemberLogTime

leaf

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;

public class MemberLogTime implements WritableComparable<MemberLogTime>{
private String member_name;
private String logTime;
public MemberLogTime() {

}
public MemberLogTime(String member_name,String logTime){
this.member_name=member_name;
this.logTime=logTime;
}
public String getMember_name() {
return member_name;
}
public void setMember_name(String member_name) {
this.member_name = member_name;
}
public String getLogTime() {
return logTime;
}
public void setLogTime(String logTime) {
this.logTime = logTime;
}
@Override
public void readFields(DataInput in) throws IOException {
this.member_name=in.readUTF();
this.logTime=in.readUTF();
}

@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(member_name);
out.writeUTF(logTime);
}
@Override
public int compareTo(MemberLogTime o) {
return this.getMember_name().compareTo(o.getMember_name());
}
@Override
public String toString() {
return this.member_name+","+this.logTime;
}
}

hide

LogCountMapper实现

leaf

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class LogCountMapper extends Mapper<Text, Text, MemberLogTime, IntWritable> {
private MemberLogTime mt=new MemberLogTime();
private IntWritable one=new IntWritable(1);
enum LogCounter{
January,
February
}
@Override
protected void map(Text key, Text value, Mapper<Text, Text, MemberLogTime, IntWritable>.Context context)
throws IOException, InterruptedException {
String member_name=key.toString();
String logTime=value.toString();
if(logTime.contains("2016-01")){
context.getCounter(LogCounter.January).increment(1);;
}else if(logTime.contains("2016-02")){
context.getCounter(LogCounter.February).increment(1);;
}
mt.setMember_name(member_name);
mt.setLogTime(logTime);
context.write(mt, one);
}
}

hide

LogCountCombiner实现

leaf

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;

public class LogCountCombiner extends Reducer<MemberLogTime, IntWritable, MemberLogTime, IntWritable> {
@Override
protected void reduce(MemberLogTime key, Iterable<IntWritable> value,
Reducer<MemberLogTime, IntWritable, MemberLogTime, IntWritable>.Context context)
throws IOException, InterruptedException {
int sum=0;
for (IntWritable val : value) {
sum+=val.get();
}
context.write(key, new IntWritable(sum));
}
}

hide

LogCountPartitioner实现

leaf

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Partitioner;
public class LogCountPartitioner extends Partitioner<MemberLogTime, IntWritable> {
@Override
public int getPartition(MemberLogTime key, IntWritable value, int numPartitions) {
String date=key.getLogTime();
if(date.contains("2016-01")){
return 0%numPartitions;
}else{
return 1%numPartitions;
}
}
}

hide

LogCountReducer实现

leaf

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
public class LogCountReducer extends Reducer<MemberLogTime, IntWritable, MemberLogTime, IntWritable> {
@Override
protected void reduce(MemberLogTime key, Iterable<IntWritable> value,
Reducer<MemberLogTime, IntWritable, MemberLogTime, IntWritable>.Context context)
throws IOException, InterruptedException {
if(key.getLogTime().contains("2016-01")){
context.getCounter("OutputCounter","JanuaryResult").increment(1);;
}else if(key.getLogTime().contains("2016-02")){
context.getCounter("OutputCounter", "FebruaryResult").increment(1);
}
int sum=0;
for (IntWritable val : value) {
sum+=val.get();
}
context.write(key, new IntWritable(sum));
}
}

hide

LogCount驱动类实现

leaf

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileAsTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class LogCount extends Configured implements Tool{
public static void main(String[] args){
String[] myArgs={
"/uer/myname/output_SelectData/part-m-00000",
"/uer/myname/output_logcount"
};

try {
ToolRunner.run(LogCount.getMyConfiguration(), new LogCount(), myArgs);
} catch (Exception e) {
e.printStackTrace();
}
}
@Override
public int run(String[] args) throws Exception {
Configuration conf=LogCount.getMyConfiguration();
Job job=Job.getInstance(conf, "logcount");
job.setJarByClass(LogCount.class);
job.setMapperClass(LogCountMapper.class);
job.setReducerClass(LogCountReducer.class);
job.setCombinerClass(LogCountCombiner.class);
job.setPartitionerClass(LogCountPartitioner.class);
job.setNumReduceTasks(2);
job.setOutputKeyClass(MemberLogTime.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(SequenceFileAsTextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileSystem.get(conf).delete(new Path(args[1]), true);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true)?-1:1;
}
public static Configuration getMyConfiguration(){
//声明配置
Configuration conf = new Configuration();
conf.setBoolean("mapreduce.app-submission.cross-platform",true);
conf.set("fs.defaultFS", "hdfs://master:9864");// 指定namenode
conf.set("mapreduce.framework.name","yarn"); // 指定使用yarn框架
String resourcenode="master";
conf.set("yarn.resourcemanager.address", resourcenode+":8032"); // 指定resourcemanager
conf.set("yarn.resourcemanager.scheduler.address",resourcenode+":8030");// 指定资源分配器
conf.set("mapreduce.jobhistory.address",resourcenode+":10020");
conf.set("mapreduce.job.jar",JarUtil.jar(LogCount.class));
return conf;
}
}

hide

JarUtil类实现

leaf

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.jar.JarEntry;
import java.util.jar.JarOutputStream;
public class JarUtil {
public static String jar(Class<?> cls){// 验证ok
String outputJar =cls.getName()+".jar";
String input = cls.getClassLoader().getResource("").getFile();
input= input.substring(0,input.length()-1);
input = input.substring(0,input.lastIndexOf("/")+1);
input =input +"bin/";
jar(input,outputJar);
return outputJar;
}
private static void jar(String inputFileName, String outputFileName){
JarOutputStream out = null;
try{
out = new JarOutputStream(new FileOutputStream(outputFileName));
File f = new File(inputFileName);
jar(out, f, "");
}catch (Exception e){
e.printStackTrace();
}finally{
try {
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}

}
private static void jar(JarOutputStream out, File f, String base) throws Exception {
if (f.isDirectory()) {
File[] fl = f.listFiles();
base = base.length() == 0 ? "" : base + "/"; // 注意，这里用左斜杠
for (int i = 0; i < fl.length; i++) {
jar(out, fl[ i], base + fl[ i].getName());
}
} else {
out.putNextEntry(new JarEntry(base));
FileInputStream in = new FileInputStream(f);
byte[] buffer = new byte[1024];
int n = in.read(buffer);
while (n != -1) {
out.write(buffer, 0, n);
n = in.read(buffer);
}
in.close();
}
}
}

hide

leaf

说明:配置文件放在src目录下

hide

log4j.properties

leaf

log4j.rootLogger = INFO,stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy/MM/dd HH:mm:ss,SSS}- %c{1}: %m%n

leaf

hide

hide

HADOOP-YARN资源分配介绍-以及推荐常用优化参数

leaf

参考:https://www.freesion.com/article/6381814121/

User Link

hide

了解一个应用程序的运行过程

leaf

步骤1：用户将应用程序提交到ResourceManager上；

leaf

步骤2：ResourceManager为应用程序ApplicationMaster申请资源，并与某个NodeManager通信，以启动ApplicationMaster；

leaf

步骤3：ApplicationMaster与ResourceManager通信，为内部要执行的任务申请资源，一旦得到资源后，将于NodeManager通信，以启动对应的任务。

leaf

步骤4：所有任务运行完成后，ApplicationMaster向ResourceManager注销，整个应用程序运行结束。

leaf

上述步骤中，步骤2~3涉及到资源申请与使用，而这正是Container出现的地方

hide

认识CONTAINER

hide

什么时容器

leaf

（1） Container是YARN中资源的抽象，它封装了某个节点上一定量的资源（CPU和内存两类资源）。它跟Linux Container没有任何关系，仅仅是YARN提出的一个概念（从实现上看，可看做一个可序列化/反序列化的Java类）。

leaf

（2） Container由ApplicationMaster向ResourceManager申请的，由ResouceManager中的资源调度器异步分配给ApplicationMaster；

leaf

（3） Container的运行是由ApplicationMaster向资源所在的NodeManager发起的，Container运行时需提供内部执行的任务命令（可以使任何命令，比如java、Python、C++进程启动命令均可）以及该命令执行所需的环境变量和外部资源（比如词典文件、可执行文件、jar包等）。

hide

Container分为两大类

leaf

（1）运行ApplicationMaster的Container：这是由ResourceManager（向内部的资源调度器）申请和启动的，用户提交应用程序时，可指定唯一的ApplicationMaster所需的资源；

leaf

（2）运行各类任务的Container：这是由ApplicationMaster向ResourceManager申请的，并由ApplicationMaster与NodeManager通信以启动之

hide

配置Hadoop运行在小内存主机上

leaf

参考:https://blog.csdn.net/skyupward/article/details/103641962

User Link

hide

hide

MapReduce如何对key或value进行降序操作

hide

定义降序排列的比较器

leaf

public static class IntWritableDecreasingComparator extends
IntWritable.Comparator {
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
return -super.compare(b1, s1, l1, b2, s2, l2);
}
}

hide

main函数中设置在sort阶段使用我们编写的比较器

leaf

//设置Sort阶段使用比较器
job.setSortComparatorClass(IntWritableDecreasingComparator.class);

hide

How Many Maps?(官网)

leaf

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.

Thus, if you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.

hide

实训1统计全球每年月的最高气温和最低气温

hide

leaf

掌握MapReduce编程中Combiner的使用

leaf

掌握自定义数据类型

leaf

掌握自定义计数器

leaf

掌握MapReduce 参数的传递

leaf

掌握Toolrunner的使用和 Eclipse 提交MapReduce任务

hide

leaf

掌握Combiner的使用

leaf

掌握自定义数据类型

hide

leaf

获取ncdc.noaa.gov上的全球气候数据，进行数据处理后生成data.txt文件，将文件上传至 hdfs，然后统计每年的最高温和最低温

hide

实现思路及步骤

hide

进入服务器/root/hadoop目录

leaf

cd /root/hadoop

hide

获取实验数据

hide

leaf

wget http://10.255.10.50/b37066/file/temp.tar

hide

leaf

wget http://bigdata.hddly.cn/b37066/file/temp.tar

leaf

wget http://home.hddly.cn:50091/file/temp.tar

hide

上传数据到hdfs

leaf

tar -xvf ./temp.tar

leaf

hdfs dfs -mkdir -p /user/myname/temp

leaf

hdfs dfs -put ./temp2021.txt /user/myname/temp

hide

验证数据是否已上传

leaf

hdfs dfs -ls /user/myname/temp/

hide

leaf

自定义一个数据类型YearMaxTAndMinT,定义字符串类型year,double类型的maxTemp和minTemp

leaf

创建MaxTAndMinTMapper,实现获取年份和气温，并将年月作为key，将气温作为value输出

leaf

创建一个MaxTAndMinTCombiner,实现年份最高气温和最低气温的获取，将月份作为key,将气温作为value输出

leaf

创建一个MaxTAndMinTReducer，实现获取年月最高气温和最低气温获取，并创建YearMaxTAndMinT对象存放，将该对象作为value,将NummWritable.get()作为key输出

hide

源代码参考

leaf

http://bigdata.hddly.cn/b37066/file/chap5_tempcount.rar

User Link

leaf

调试通过，然后导出jar，命名为tempcount.jar，并上传hadoop集群，运行 hadoop jar ./tempcount.jar

leaf

查看输出结果:查看hdfs:/user/myname/output_tempcount目录下结果

hide

leaf

1，环境说明:本小组主机:,本小组成员机:,本成员机:

leaf

2，在http://master:9870上拍照截取本小组集群中本成员目录下/user/myname中上传的文件,需包含temp目录和文件

leaf

3，在eclipse中，分别截图 map类，reduce类等，main方法等的源码图

leaf

4，在eclipse中，运行，截取运行console内容图

leaf

5，查集群linux本成员虚拟下运行tempcount.jar程序，截图

leaf

6，在http://master:9870的文件系统中，打开运行输出结果:/user/myname/output_tempcount/下的文件内容，截图

hide

实训2筛选气温在15~25C之间的数据

hide

leaf

掌握MapReduce编程中Combiner的使用

leaf

掌握自定义数据类型

leaf

掌握自定义计数器

leaf

掌握MapReduce 参数的传递

leaf

掌握Toolrunner的使用和 Eclipse 提交MapReduce任务

hide

leaf

掌握Combiner的使用

leaf

掌握自定义数据类型

hide

leaf

获取ncdc.noaa.gov上的全球气候数据，进行数据处理后生成data.txt文件，将文件上传至 hdfs，然后统计每年的最高温和最低温

hide

实现思路及步骤

hide

进入服务器/root/hadoop目录

leaf

cd /root/hadoop

hide

获取实验数据

hide

leaf

wget http://10.255.10.50/b37066/file/temp.tar

hide

leaf

wget http://bigdata.hddly.cn/b37066/file/temp.tar

hide

上传数据到hdfs

leaf

tar -xvf ./temp.tar

leaf

hdfs dfs -mkdir -p /user/myname/temp

leaf

hdfs dfs -put ./temp2021.txt /user/myname/temp

hide

验证数据是否已上传

leaf

hdfs dfs -ls /user/myname/temp/

hide

leaf

创建TempSelectMapper,实现温度数据筛选，将记录作为value输出，NullWritable作为key输出

leaf

创建TempSelectRun继承自 Tool,实现参数的设置和ToolRunner的run调用

hide

源代码参考

leaf

http://bigdata.hddly.cn/b37066/file/chap5_tempselect.rar

User Link

leaf

调试通过，然后导出jar，命名为tempselect.jar，并上传hadoop集群，运行 hadoop jar ./tempselect.jar

leaf

查看输出结果:查看hdfs:/user/myname/output_tempselectrun目录下结果

hide

leaf

1，环境说明:本小组主机:,本小组成员机:,本成员机:

leaf

2，在http://master:9870上拍照截取本小组集群中本成员目录下/user/myname中上传的文件,需包含temp目录和文件

leaf

3，在eclipse中，分别截图 map类，main方法的源码图

leaf

4，在eclipse中，运行，截取运行console内容图

leaf

5，查集群linux本成员虚拟下运行程序tempselect.jar ，截图

leaf

6，在http://master:9870的文件系统中，打开运行输出结果:/user/myname/output_tempselectrun/下的文件内容，截图

hide

综合实训5Linux服务安全监测系统

hide

leaf

当前各企业面临挖矿病毒的威胁,病毒占用大量的系统资源和带宽资源，严重影响正常的业务,现已波及众多的阿里云主机,国家层面已开始集中整治

leaf

急需一套系统能够快速巡检所有Linux系统,及时发现挖矿病毒,及时通报并给出处理建议

hide

实现思路及步骤

leaf

获取Linux日志相关数据

leaf

上传数据到hdfs

leaf

编写Java代码实现日分析

leaf

缩写Python代码实现可视化

leaf

检测到病毒后处理

leaf

leaf

项目宣传与推广

hide

leaf

项目参与者可以将此项目作为本课程综合实验来提交相应报告

leaf

项目在与者额外获得1到10分的本课程分

hide

hide

hide

挖矿病毒zz.sh

leaf

https://blog.csdn.net/m0_37313888/article/details/82869939

User Link

hide

hide

SequenceFile.Reader过时

leaf

//获取Option实例，新方法
SequenceFile.Reader.Option pathOption = SequenceFile.Reader.file(new Path("/user/root/JanFeb/part-m-00000"));
//获取Reader实例
SequenceFile.Reader reader1 = new SequenceFile.Reader(conf, pathOption);

leaf

//获取文件系统
// FileSystem fs=FileSystem.get(conf);
//获取SequenceFile.Reader对象,过时，旧版API中需要传入FileSystem实例才能完成写操作
// SequenceFile.Reader reader=new SequenceFile.Reader(fs, new Path("/user/root/JanFeb/part-m-00000"), conf);

leaf

参考:https://blog.csdn.net/mrkkmrkkk/article/details/108449363

leaf

运行代码5-29时报异常
java.lang.ClassNotFoundException: org.eclipse.jetty.websocket.api.WebSocketException

hide

Cannot allocate containers as requested resource is greater than maximum allowed allocation

hide

在eclipse中运行LogCountRun报错

leaf

Invalid resource request! Cannot allocate containers as requested resource is greater than maximum allowed allocation. Requested resource type=[memory-mb], Requested resource=<memory:1536, vCores:1>, maximum allowed allocation=<memory:1024, vCores:2>, please note that maximum allowed allocation is calculated by scheduler based on maximum resource of registered NodeManagers, which might be less than configured maximum allocation=<memory:1024, vCores:4>

leaf

http://i.hddly.cn/media/eclipse_s5r6XoqmsG.png

User Link

hide

leaf

原因是yarn-site.xml中的两处配置值太小了，不满足作业的申请条件
把yarn-site.xml中的两处配置加大一点：
容器内存 yarn.nodemanager.resource.memory-mb
最大容器内存 yarn.scheduler.maximum-allocation-mb

hide

原因1,在yarn-site.xml中配置如

leaf

<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1024</value>
</property>

leaf

<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>1537</value>
</property>

hide

Could not find or load main class
org.apache.hadoop.mapreduce.v2.app.MRAppMaster”

hide

再次运行，报错“ Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster”

hide

leaf

http://i.hddly.cn/media/eclipse_s1m4wReCmH.png

User Link

hide

leaf

https://blog.csdn.net/qq_41684957/article/details/81710190?spm=1001.2101.3001.6661.1

hide

leaf

Please check whether your <HADOOP_HOME>/etc/hadoop/mapred-site.xml contains the below configuration:
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
</property>

hide

hide

使用hadoop classpath查找路径

leaf

[root@master hadoop]# hadoop classpath
/usr/local/hadoop-3.3.1/etc/hadoop:/usr/local/hadoop-3.3.1/share/hadoop/common/lib/*:/usr/local/hadoop-3.3.1/share/hadoop/common/*:/usr/local/hadoop-3.3.1/share/hadoop/hdfs:/usr/local/hadoop-3.3.1/share/hadoop/hdfs/lib/*:/usr/local/hadoop-3.3.1/share/hadoop/hdfs/*:/usr/local/hadoop-3.3.1/share/hadoop/mapreduce/*:/usr/local/hadoop-3.3.1/share/hadoop/yarn:/usr/local/hadoop-3.3.1/share/hadoop/yarn/lib/*:/usr/local/hadoop-3.3.1/share/hadoop/yarn/*
[root@master hadoop]#

hide

编辑 vi yarn-site.xml

leaf

<configuration>
<property>
<name>yarn.application.classpath</name>
<value>输入刚才返回的Hadoop classpath路径</value>
</property>
</configuration>

hide

messagebox_warning

编辑 vi mapred-site.xml

leaf

<property>
<name>mapreduce.application.classpath</name>
<value>输入刚才返回的Hadoop classpath路径</v
alue>
</property>

leaf

在所有的Master和Slave节点进行如上设置，设置完毕后重启Hadoop集群，重新运行刚才的MapReduce程序，成功运行

hide

org.apache.hadoop.mapreduce.v2.app.MRAppMaster:
Error starting MRAppMaster

hide

leaf

错误日志：2023-05-23 01:18:44,500 ERROR [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster java.lang.UnsupportedClassVersionError: train3_musiccount/S1_MusicSelectData$SelectDataMapper has been compiled by a more recent version of the Java Runtime (class file version 61.0), this version of the Java Runtime only recognizes class file versions up to 52.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:756) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)

leaf

原因分析：提示class file version 61.0，表示eclipse上使用的java编译器不是我们需要java1.8

leaf

处理:修改eclipse->项目属性->Java编译器，由java17版本改为java1.8,重新编译打包发布

leaf

Not enough documents for more than one split! Consider setting mongo.input.split_size to a lower value.

hide

连接Mongodb库查数据报错:

leaf

查看日志:/usr/local/hadoop-3.3.1/logs/userlogs/application_1649938751428_0002/container_1649938751428_0002_01_000402/syslog

leaf

2022-04-14 21:03:08,216 INFO [main] org.mongodb.driver.connection: Opened connection [connectionId{localValue:2, serverValue:877}] to home.hddly.cn:57017
2022-04-14 21:03:09,770 ERROR [main] com.mongodb.hadoop.input.MongoRecordReader: Exception reading next key/val from mongo: Query failed with error code 51173 and error mes
sage 'error processing query: ns=pythondb.news_dataTree: $and
Sort: {}
Proj: {}
planner returned error :: caused by :: When using min()/max() a hint of which index to use must be provided' on server home.hddly.cn:57017

hide

leaf

https://blog.csdn.net/u011007180/article/details/53233300

User Link

leaf

Exception reading next key/val from mongo: Query failed with error code 51173

hide

使用yarn方式运行LogCountRun,报slf4j错

hide

leaf

[2023-03-30 07:34:40.108]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

hide

使用yarn方式运行LogCountRun,报prelaunch.err错

hide

leaf

[2023-03-30 07:34:40.108]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

hide

leaf

打开master:8088任务

leaf

进入任务详细信息页，进入logs链接

leaf

发现错误信息：2023-03-30 10:41:22,178 ERROR [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster java.lang.UnsupportedClassVersionError: chap5_logcount/LogCountMapper has been compiled by a more recent version of the Java Runtime (class file version 61.0), this version of the Java Runtime only recognizes class file versions up to 52.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:756)

leaf

原来是eclipse上jdk版本偏高

hide

leaf

eclipse->菜单windows->performances->java->compiler->compiler compliance level调到1.8版本

leaf

重新运行，成功

hide

其它Hadoop常见问题

leaf

https://www.cnblogs.com/yinzhengjie/p/13766307.html

hide

hide

Ver1.1-20220121

leaf

hide

Ver1.2-20230523

leaf

增加了常见问题,因为java版本未指定为1.8