基于文件系统(及MySQL)使用Java实现MapReduce
实现这个代码的原因是:
- 我会MapReduce,但是之前都是在AWS EMR上,自己搭过伪分布式的,但是感觉运维起来比较困难;
- 我就MySQL会一点(本来想用mongoDB的但是不太会啊)
- 数据量不是很大,至少对我来说。
- 希望不要出很么问题,这方面文件系统还是可以信任的。
设计思路如下:
- init阶段:将所需的文件添加到一个列表文件
input_file_list.txt
中。 - Map阶段:读取
input_file_list.txt
中的每一个文件的每一行,并将其映射成一个key-value对。
考虑到key可能包含特殊字符,所以这里使用MySQL存储一个id到key的对应关系的数据。 - Reduce阶段:针对每一个key,读取对应的文件,最终生成一个name-value列表,该name-value列表对应一个json对象,如:
{ "name": "zifeiy", "age": 88 }
,将所有的json对象存储到一个结果文件reduceResult.txt
中。 - 处理结果阶段,将
reduceResult.txt
文件进行解析,最终生成结果的CSV文件或者Excel文件。
主要代码:
package com.zifeiy.snowflake.tools.mapreduce.v1;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.Statement;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import com.google.gson.Gson;
import com.zifeiy.snowflake.assist.CsvOneLineParser;
import com.zifeiy.snowflake.assist.FileHelper;
import jxl.Workbook;
import jxl.write.Label;
import jxl.write.WritableSheet;
import jxl.write.WritableWorkbook;
public abstract class MapReduceBaseVersion1 {
private static final String APPENDED_DB_INFO = "?useUnicode=true&characterEncoding=UTF8"
+ "&rewriteBatchedStatements=true"
+ "&useLegacyDatetimeCode=false"
+ "&serverTimezone=Asia/Shanghai"
+ "&useSSL=false";
private static final String classname = "com.mysql.cj.jdbc.Driver";
private static final String url = "jdbc:mysql://localhost:3306/snowflake" + APPENDED_DB_INFO;
private static final String username = "root";
private static final String password = "password";
public static final String taskRootPath = "D:\\snowflake\\task";
private Connection connection = null;
private File inputListFile = null;
private File reduceResultFile = null;
private File resultFile = null;
private int taskId;
public void addInputPath(File file) throws IOException {
FileHelper.appendFile(inputListFile, file.getAbsolutePath() + "\r\n");
}
public void setKeyValuePair(String key, String value) throws Exception {
int id = -1;
Statement statement = connection.createStatement();
ResultSet resultSet = statement.executeQuery(String.format("select id from tmp" + taskId + " where kname='%s'", key.replaceAll("'", "''")));
if (resultSet.next()) {
id = resultSet.getInt(1);
}
else {
statement.execute(String.format("insert into tmp" + taskId + " (kname) values ('%s')", key.replaceAll("'", key.replaceAll("'", "''"))));
resultSet = statement.executeQuery(String.format("select id from tmp" + taskId + " where kname='%s'", key.replaceAll("'", "''")));
if (resultSet.next()) {
id = resultSet.getInt(1);
}
}
if (id == -1) throw new Exception("set key value pair failed: key = " + key + ", value = " + value);
File tmpFile = new File(taskRootPath + File.separator + taskId + File.separator + "tmp" + File.separator + id + ".txt");
if (tmpFile.exists() == false) {
tmpFile.createNewFile();
}
FileHelper.appendFile(tmpFile, value + "\r\n");
}
public void addParamList(List<Map<String, String>> paramList) throws Exception {
String content = "";
Gson gson = new Gson();
for (Map<String, String> params : paramList) {
String jsonString = gson.toJson(params);
content += jsonString + "\r\n";
}
FileHelper.appendFile(reduceResultFile, content);
}
public void generateFile(String[] columns, String[] nameColumns) throws Exception {
if (reduceResultFile == null || reduceResultFile.exists() == false) {
throw new Exception("[mapreduce.v1] in generateFile function: reduceResultFile do not exist!");
}
// if (false) { // test
if (reduceResultFile.length() > 1 * 1024 * 1024) { // 如果文件大小超过1MB,导出成csv
resultFile = new File(taskRootPath + File.separator + taskId + File.separator + "result.csv");
Gson gson = new Gson();
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(reduceResultFile), "UTF-8"));
FileOutputStream fos = new FileOutputStream(resultFile);
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");
String content = "";
for (int i = 0; i < nameColumns.length; i ++) {
if (i > 0)
content += ",";
content += '"' + nameColumns[i] + '"';
}
osw.write(content + "\r\n");
String line = null;
while ((line = br.readLine()) != null) {
content = "";
Map<String, String> map = gson.fromJson(line, Map.class);
if (map == null) { throw new Exception("map is null by parsing line: " + line); }
for (int i = 0; i < columns.length; i ++) {
if (i > 0) content += ",";
String c = columns[i];
String v = map.get(c);
if (v != null) {
content += '"' + v + '"';
}
}
osw.write(content + "\r\n");
}
br.close();
osw.write(content);
osw.flush();
osw.close();
} else { // 如果文件大小小于1MB,导出成Excel文件
resultFile = new File(taskRootPath + File.separator + taskId + File.separator + "result.xls");
WritableWorkbook workbook = Workbook.createWorkbook(resultFile);
WritableSheet sheet = workbook.createSheet("sheet1", 0);
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(reduceResultFile), "UTF-8"));
String line = null;
for (int i = 0; i < nameColumns.length; i ++) {
sheet.addCell(new Label(i, 0, nameColumns[i]));
}
int rowId = 1;
while ((line = br.readLine()) != null) {
Gson gson = new Gson();
List<String> rowList = new ArrayList<String>();
Map<String, String> map = gson.fromJson(line, Map.class);
if (map == null) { throw new Exception("map is null by parsing line: " + line); }
for (int i = 0; i < columns.length; i ++) {
String c = columns[i];
String v = map.get(c);
String innerContent = "";
if (v != null) {
innerContent = v;
}
sheet.addCell(new Label(i, rowId, innerContent));
}
rowId ++;
}
br.close();
workbook.write();
workbook.close();
}
}
public abstract void init() throws Exception;
public abstract void map(String line) throws Exception;
public abstract void reduce(String key, ReduceReader reduceReader) throws Exception;
public abstract void generate() throws Exception;
public String mapreduce() {
try {
Class.forName(classname);
connection = DriverManager.getConnection(url, username, password);
// generate taskId
PreparedStatement preparedStatement = connection.prepareStatement("insert into task () values ()");
preparedStatement.execute("insert into task () values ()", PreparedStatement.RETURN_GENERATED_KEYS);
ResultSet resultSet = preparedStatement.getGeneratedKeys();
if (resultSet.next()) {
taskId = resultSet.getInt(1);
}
else {
throw new Exception("[mapreduce.v1] Exception: can not generate taskId");
}
// generated task file path
String taskPath = taskRootPath + File.separator + taskId;
File taskPathDir = new File(taskPath);
if (taskPathDir.exists() == true) {
throw new Exception("[mapreduce.v1] Exception: task directory already exists");
}
taskPathDir.mkdirs();
String tmpDirPath = taskPath + File.separator + "tmp";
File tmpDir = new File(tmpDirPath);
tmpDir.mkdirs();
this.inputListFile = new File(taskPath + File.separator + "input_file_list.txt");
inputListFile.createNewFile();
// period. 1: init()
// during init period, we will use addInputPath function to add all the input files we need
init();
// begin to read each line of each file
// peroid. 2: map(line)
// db prepare
Statement statement = connection.createStatement();
statement.execute("create temporary table tmp" + taskId + " ( id int not null auto_increment primary key, kname varchar(200) )");
// file content prepare
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputListFile), "UTF-8"));
String inputFilename = null;
while ((inputFilename = br.readLine()) != null) {
File inputFile = new File(inputFilename);
if (inputFile.exists() == false) {
throw new Exception("[mapreduce.v1] Exception: input file " + inputFilename + " do not exists!");
}
BufferedReader br2 = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile), "GBK"));
String line = null;
while ((line = br2.readLine()) != null) {
map(line);
}
}
br.close();
// period. 3: reduce(key, valueList)
reduceResultFile = new File(taskPath + File.separator + "reduce.txt");
if (reduceResultFile.exists() == true) {
throw new Exception("[mapreduce.v1] reduce file already exists!");
}
reduceResultFile.createNewFile();
resultSet = statement.executeQuery("select * from tmp" + taskId);
while (resultSet.next()) {
int id = resultSet.getInt(1);
String key = resultSet.getString(2);
File reduceFile = new File(tmpDirPath + File.separator + id + ".txt");
if (reduceFile.exists() == false) {
throw new Exception("[mapreduce.v1] Exception: reduce file " + reduceFile.getName() + " not exists!");
}
ReduceReader reduceReader = new ReduceReader(reduceFile);
reduce(key, reduceReader);
}
// period. 4: generate
// generate the result file
generate();
connection.close();
} catch (Exception e) {
e.printStackTrace();
}
if (resultFile == null) return null;
else return resultFile.getAbsolutePath();
}
// main for test
public static void main(String[] args) {
MapReduceBaseVersion1 mapReduceBaseVersion1 = new MapReduceBaseVersion1() {
@Override
public void reduce(String key, ReduceReader reduceReader) throws Exception {
// TODO Auto-generated method stub
List<Map<String, String>> paramList = new ArrayList<Map<String,String>>();
String line;
while ( (line = reduceReader.next()) != null ) {
List<String> rowList = CsvOneLineParser.parseLine(line);
Map<String, String> tmpMap = new HashMap<String, String>();
int idx = 0;
for (String s : rowList) {
idx ++;
tmpMap.put("" + idx, s);
}
paramList.add(tmpMap);
}
addParamList(paramList);
}
@Override
public void map(String line) throws Exception {
// TODO Auto-generated method stub
setKeyValuePair(line.substring(1, 3), line);
}
@Override
public void init() throws Exception {
// TODO Auto-generated method stub
addInputPath(new File("D:\\test\\test.del"));
}
@Override
public void generate() throws Exception {
// TODO Auto-generated method stub
generateFile(new String[] { "1", "2", "3", "4", "5", "6" }, new String[] { "一", "二", "三", "四", "五", "六" });
}
};
System.out.println(mapReduceBaseVersion1.mapreduce());
}
}
基于文件系统(及MySQL)使用Java实现MapReduce的更多相关文章
- 基于amoeba实现mysql数据库的读写分离/负载均衡
一.Amoeba的简述:[来自百度百科] Amoeba是一个以MySQL为底层数据存储,并对应用提供MySQL协议接口的proxy.它集中地响应应用的请求,依据用户事先设置的规则,将SQL请 ...
- 简单的java Hadoop MapReduce程序(计算平均成绩)从打包到提交及运行
[TOC] 简单的java Hadoop MapReduce程序(计算平均成绩)从打包到提交及运行 程序源码 import java.io.IOException; import java.util. ...
- SpringMVC内容略多 有用 熟悉基于JSP和Servlet的Java Web开发,对Servlet和JSP的工作原理和生命周期有深入了解,熟练的使用JSTL和EL编写无脚本动态页面,有使用监听器、过滤器等Web组件以及MVC架构模式进行Java Web项目开发的经验。
熟悉基于JSP和Servlet的Java Web开发,对Servlet和JSP的工作原理和生命周期有深入了解,熟练的使用JSTL和EL编写无脚本动态页面,有使用监听器.过滤器等Web组件以及MVC架构 ...
- docker初识-docker安装、基于docker安装mysql及tomcat、基本命令
一.docker是什么 用go语言开发,开源的应用容器引擎,容器性能开销极低 二.整体架构图 Docker 包括三个基本概念: 镜像(Image):Docker 镜像(Image),就相当于是一个 r ...
- 【转载】 使用Anemometer基于pt-query-digest将MySQL慢查询可视化
原文地址:使用Anemometer基于pt-query-digest将MySQL慢查询可视化 作者:84223932 本文主要介绍使用Anemometer基于pt-query-digest将MySQL ...
- Mysql,Oracle,Java数据类型对应
Mysql Oracle Java BIGINT NUMBER(19,0) java.lang.Long BIT RAW byte[] BLOB BLOB RAW byte[] CHAR CHAR j ...
- xPool - 基于mysqlclient的mysql的c++连接池 - xnhcx的个人空间 - 开源中国社区
xPool - 基于mysqlclient的mysql的c++连接池 - xnhcx的个人空间 - 开源中国社区 xPool - 基于mysqlclient的mysql的c++连接池
- c# 基于文件系统实现的队列处理类
现实业务中经常遇到需要队列处理的问题. 问题场景: 客户端记录设备运行数据,传输给服务器.在传输中可能存在延迟或中断情况.当中断时,系统传输数据可能因为无法传输或电脑重启,会导致服务器数据记录不连续. ...
- MySQL基于GTIDs的MySQL Replication
MySQL M-S GTID 基于GTIDs的MySQL Replication 什么是GTIDs以及有什么特定? 1.GTIDs(Global transaction identifiers)全局事 ...
- 基于SSL实现Mysql加密主从
Mysql主从复制是明文传输的,对于一些特殊的场合是绝对不允许的,数据的安全性会受到威胁,在这里,简单的构建基于SSL的mysql主从复制 Ps:这里采用master-mysql为CA服务器 主端生成 ...
随机推荐
- 浅谈linux用户与用户组的概念
原文链接;http://linuxme.blog.51cto.com/1850814/347086 作者:linuxme1.用户 用户是能够获取系统资源的权限的集合. .linux用户组的分类: a. ...
- IdentityServer(二)客户端授权模式
前言 客户端授权模,客户端直接向Identity Server申请token并访问资源.客户端授权模式比较适用于服务之间的通信. 搭建Identity服务 新建名为 IdentityServer 的W ...
- Nginx中ngx_http_proxy_module模块
该模块允许将请求传递给另⼀一台服务器器指令:1 ,proxy_pass设置代理理服务器器的协议和地址以及应映射位置的可选 URI .作为协议,可以指定“ http 或 https .可以将地址指定为域 ...
- Java并发包--ThreadPoolExecutor
转载请注明出处:http://www.cnblogs.com/skywang12345/p/3509941.html ThreadPoolExecutor简介 ThreadPoolExecutor是线 ...
- ubuntu redis 安装 &基本命令
参考资料:https://www.cnblogs.com/zongfa/p/7808807.htmlredis命令参考:http://doc.redisfans.com/安装:sudo apt-get ...
- python MySQL安装依赖报错的坑
0X01 问题 MySQL-python是python调用MySQL的常用库 通常安装时会遇到某些坑. EnvironmentError: mysql_config not found yum -y ...
- VMware WorkStations最小化安装&配置&卸载CentOS 7
所需软件: VMware WorkStations,CentOS 7镜像文件(可以在CentOS官网下载) 1.打开VMware WorkStations,点击创建虚拟机 2.选择典型,点击下一步 3 ...
- Linux查看公网IP
curl cip.cc 查看公网IP curl -s icanhazip.com 查看公网IP, 只显示IP,没有供应商信息
- P4295 [SCOI2003]严格N元树 DP
思路:DP 提交:\(5\)次 错因:2次高精写错(我太菜了),2次写错特判 题解: 设\(f[i]\)表示深度\(\leq i\)的严格\(n\)元树的数目,有 \[f[i]=pow(f[i-1], ...
- C++问题--fread文件读不完整问题解决
今天突然遇到一个问题,用fwrite/fread读写文件,发现当fread读取文件时只能读一半, 即使用foef()查看是否读到文件结尾,也是显示文件已经读取到文件末尾,查看文件的返回值发现文件只读取 ...