基于文件系统（及MySQL）使用Java实现MapReduce

实现这个代码的原因是：

我会MapReduce，但是之前都是在AWS EMR上，自己搭过伪分布式的，但是感觉运维起来比较困难；
我就MySQL会一点（本来想用mongoDB的但是不太会啊）
数据量不是很大，至少对我来说。
希望不要出很么问题，这方面文件系统还是可以信任的。

设计思路如下：

init阶段：将所需的文件添加到一个列表文件input_file_list.txt中。
Map阶段：读取input_file_list.txt中的每一个文件的每一行，并将其映射成一个key-value对。

考虑到key可能包含特殊字符，所以这里使用MySQL存储一个id到key的对应关系的数据。
Reduce阶段：针对每一个key，读取对应的文件，最终生成一个name-value列表，该name-value列表对应一个json对象，如：{ "name": "zifeiy", "age": 88 }，将所有的json对象存储到一个结果文件reduceResult.txt中。
处理结果阶段，将reduceResult.txt文件进行解析，最终生成结果的CSV文件或者Excel文件。

主要代码：

package com.zifeiy.snowflake.tools.mapreduce.v1;

import java.io.BufferedReader;

import java.io.File;

import java.io.FileInputStream;

import java.io.FileOutputStream;

import java.io.IOException;

import java.io.InputStreamReader;

import java.io.OutputStreamWriter;

import java.sql.Connection;

import java.sql.DriverManager;

import java.sql.PreparedStatement;

import java.sql.ResultSet;

import java.sql.Statement;

import java.util.ArrayList;

import java.util.HashMap;

import java.util.List;

import java.util.Map;

import com.google.gson.Gson;

import com.zifeiy.snowflake.assist.CsvOneLineParser;

import com.zifeiy.snowflake.assist.FileHelper;

import jxl.Workbook;

import jxl.write.Label;

import jxl.write.WritableSheet;

import jxl.write.WritableWorkbook;

public abstract class MapReduceBaseVersion1 {

	private static final String APPENDED_DB_INFO = "?useUnicode=true&characterEncoding=UTF8"

										            + "&rewriteBatchedStatements=true"

										            + "&useLegacyDatetimeCode=false"

										            + "&serverTimezone=Asia/Shanghai"

										            + "&useSSL=false";

	private static final String classname		= "com.mysql.cj.jdbc.Driver";

	private static final String url				= "jdbc:mysql://localhost:3306/snowflake" + APPENDED_DB_INFO;

	private static final String username			= "root";

	private static final String password    = "password";

	public static final String taskRootPath = "D:\\snowflake\\task";

	private Connection connection = null;

	private File inputListFile = null;

	private File reduceResultFile = null;

	private File resultFile = null;

	private int taskId;

	public void addInputPath(File file) throws IOException {

		FileHelper.appendFile(inputListFile, file.getAbsolutePath() + "\r\n");

	}

	public void setKeyValuePair(String key, String value) throws Exception {

		int id = -1;

		Statement statement = connection.createStatement();

		ResultSet resultSet = statement.executeQuery(String.format("select id from tmp" + taskId + " where kname='%s'", key.replaceAll("'", "''")));

		if (resultSet.next()) {

			id = resultSet.getInt(1);

		}

		else {

			statement.execute(String.format("insert into tmp" + taskId + " (kname) values ('%s')", key.replaceAll("'", key.replaceAll("'", "''"))));

			resultSet = statement.executeQuery(String.format("select id from tmp" + taskId + " where kname='%s'", key.replaceAll("'", "''")));

			if (resultSet.next()) {

				id = resultSet.getInt(1);

			}

		}

        if (id == -1) throw new Exception("set key value pair failed: key = " + key + ", value = " + value);

        File tmpFile = new File(taskRootPath + File.separator + taskId + File.separator + "tmp" + File.separator + id + ".txt");

        if (tmpFile.exists() == false) {

        	tmpFile.createNewFile();

        }

        FileHelper.appendFile(tmpFile, value + "\r\n");

	}

	public void addParamList(List<Map<String, String>> paramList) throws Exception {

		String content = "";

		Gson gson = new Gson();

		for (Map<String, String> params : paramList) {

			String jsonString = gson.toJson(params);

			content += jsonString + "\r\n";

		}

		FileHelper.appendFile(reduceResultFile, content);

	}

	public void generateFile(String[] columns, String[] nameColumns) throws Exception {

		if (reduceResultFile == null || reduceResultFile.exists() == false) {

			throw new Exception("[mapreduce.v1] in generateFile function: reduceResultFile do not exist!");

		}

//		if (false) {	// test

		if (reduceResultFile.length() > 1 * 1024 * 1024) {	// 如果文件大小超过1MB，导出成csv

			resultFile = new File(taskRootPath + File.separator + taskId + File.separator + "result.csv");

			Gson gson = new Gson();

			BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(reduceResultFile), "UTF-8"));

			FileOutputStream fos = new FileOutputStream(resultFile);

	        OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");

	        String content = "";

	        for (int i = 0; i < nameColumns.length; i ++) {

	        	if (i > 0)

	        		content += ",";

	        	content += '"' + nameColumns[i] + '"';

	        }

	        osw.write(content + "\r\n");

	        String line = null;

	        while ((line = br.readLine()) != null) {

	        	content = "";

	        	Map<String, String> map = gson.fromJson(line, Map.class);

	        	if (map == null) { throw new Exception("map is null by parsing line: " + line); }

	        	for (int i = 0; i < columns.length; i ++) {

	        		if (i > 0) content += ",";

	        		String c = columns[i];

	        		String v = map.get(c);

	        		if (v != null) {

	        			content += '"' + v + '"';

	        		}

	        	}

	        	osw.write(content + "\r\n");

	        }

	        br.close();

	        osw.write(content);

	        osw.flush();

	        osw.close();

		} else {	// 如果文件大小小于1MB，导出成Excel文件

			resultFile = new File(taskRootPath + File.separator + taskId + File.separator + "result.xls");

			WritableWorkbook workbook = Workbook.createWorkbook(resultFile);

			WritableSheet sheet = workbook.createSheet("sheet1", 0);

			BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(reduceResultFile), "UTF-8"));

	        String line = null;

	        for (int i = 0; i < nameColumns.length; i ++) {

	        	sheet.addCell(new Label(i, 0, nameColumns[i]));

	        }

	        int rowId = 1;

	        while ((line = br.readLine()) != null) {

	        	Gson gson = new Gson();

	            List<String> rowList = new ArrayList<String>();

	            Map<String, String> map = gson.fromJson(line, Map.class);

	        	if (map == null) { throw new Exception("map is null by parsing line: " + line); }

	        	for (int i = 0; i < columns.length; i ++) {

	        		String c = columns[i];

	        		String v = map.get(c);

	        		String innerContent = "";

	        		if (v != null) {

	        			innerContent = v;

	        		}

	        		sheet.addCell(new Label(i, rowId, innerContent));

	        	}

            	rowId ++;

	        }

	        br.close();

	        workbook.write();

	        workbook.close();

		}

	}

	public abstract void init() throws Exception;

	public abstract void map(String line) throws Exception;

	public abstract void reduce(String key, ReduceReader reduceReader) throws Exception;

	public abstract void generate() throws Exception;

	public String mapreduce() {

		try {

			Class.forName(classname);

			connection = DriverManager.getConnection(url, username, password);

			// generate taskId

			PreparedStatement preparedStatement = connection.prepareStatement("insert into task () values ()");

			preparedStatement.execute("insert into task () values ()", PreparedStatement.RETURN_GENERATED_KEYS);

            ResultSet resultSet = preparedStatement.getGeneratedKeys();

            if (resultSet.next()) {

            	taskId = resultSet.getInt(1);

            }

            else {

            	throw new Exception("[mapreduce.v1] Exception: can not generate taskId");

            }

            // generated task file path

            String taskPath = taskRootPath + File.separator + taskId;

            File taskPathDir = new File(taskPath);

            if (taskPathDir.exists() == true) {

            	throw new Exception("[mapreduce.v1] Exception: task directory already exists");

            }

            taskPathDir.mkdirs();

            String tmpDirPath = taskPath + File.separator + "tmp";

            File tmpDir = new File(tmpDirPath);

            tmpDir.mkdirs();

            this.inputListFile = new File(taskPath + File.separator + "input_file_list.txt");

            inputListFile.createNewFile();

            // period. 1: init()

            // during init period, we will use addInputPath function to add all the input files we need

            init();

            // begin to read each line of each file

            // peroid. 2: map(line)

            // db prepare

            Statement statement = connection.createStatement();

            statement.execute("create temporary table tmp" + taskId + " ( id int not null auto_increment primary key, kname varchar(200) )");

            // file content prepare

            BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputListFile), "UTF-8"));

            String inputFilename = null;

            while ((inputFilename = br.readLine()) != null) {

                File inputFile = new File(inputFilename);

                if (inputFile.exists() == false) {

                	throw new Exception("[mapreduce.v1] Exception: input file " + inputFilename + " do not exists!");

                }

                BufferedReader br2 = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile), "GBK"));

                String line = null;

                while ((line = br2.readLine()) != null) {

                	map(line);

                }

            }

            br.close();

            // period. 3: reduce(key, valueList)

            reduceResultFile = new File(taskPath + File.separator + "reduce.txt");

            if (reduceResultFile.exists() == true) {

            	throw new Exception("[mapreduce.v1] reduce file already exists!");

            }

            reduceResultFile.createNewFile();

	        resultSet = statement.executeQuery("select * from tmp" + taskId);

	        while (resultSet.next()) {

	        	int id = resultSet.getInt(1);

	        	String key = resultSet.getString(2);

	        	File reduceFile = new File(tmpDirPath + File.separator + id + ".txt");

	        	if (reduceFile.exists() == false) {

	        		throw new Exception("[mapreduce.v1] Exception: reduce file " + reduceFile.getName() + " not exists!");

	        	}

	        	ReduceReader reduceReader = new ReduceReader(reduceFile);

	        	reduce(key, reduceReader);

	        }

	        // period. 4: generate

	        // generate the result file

	        generate();

	        connection.close();

		} catch (Exception e) {

			e.printStackTrace();

		}

		if (resultFile == null) return null;

		else return resultFile.getAbsolutePath();

	}

	// main for test

	public static void main(String[] args) {

		MapReduceBaseVersion1 mapReduceBaseVersion1 = new MapReduceBaseVersion1() {

			@Override

			public void reduce(String key, ReduceReader reduceReader) throws Exception {

				// TODO Auto-generated method stub

				List<Map<String, String>> paramList = new ArrayList<Map<String,String>>();

				String line;

				while ( (line = reduceReader.next()) != null ) {

					List<String> rowList = CsvOneLineParser.parseLine(line);

					Map<String, String> tmpMap = new HashMap<String, String>();

					int idx = 0;

					for (String s : rowList) {

						idx ++;

						tmpMap.put("" + idx, s);

					}

					paramList.add(tmpMap);

				}

				addParamList(paramList);

			}

			@Override

			public void map(String line) throws Exception {

				// TODO Auto-generated method stub

				setKeyValuePair(line.substring(1, 3), line);

			}

			@Override

			public void init() throws Exception {

				// TODO Auto-generated method stub

				addInputPath(new File("D:\\test\\test.del"));

			}

			@Override

			public void generate() throws Exception {

				// TODO Auto-generated method stub

				generateFile(new String[] { "1", "2", "3", "4", "5", "6" }, new String[] { "一", "二", "三", "四", "五", "六" });

			}

		};

		System.out.println(mapReduceBaseVersion1.mapreduce());

	}

}

基于文件系统（及MySQL）使用Java实现MapReduce的更多相关文章

基于amoeba实现mysql数据库的读写分离/负载均衡
一.Amoeba的简述:[来自百度百科] Amoeba是一个以MySQL为底层数据存储,并对应用提供MySQL协议接口的proxy.它集中地响应应用的请求,依据用户事先设置的规则,将SQL请 ...
简单的java Hadoop MapReduce程序(计算平均成绩)从打包到提交及运行
[TOC] 简单的java Hadoop MapReduce程序(计算平均成绩)从打包到提交及运行程序源码 import java.io.IOException; import java.util. ...
SpringMVC内容略多有用熟悉基于JSP和Servlet的Java Web开发，对Servlet和JSP的工作原理和生命周期有深入了解，熟练的使用JSTL和EL编写无脚本动态页面，有使用监听器、过滤器等Web组件以及MVC架构模式进行Java Web项目开发的经验。
熟悉基于JSP和Servlet的Java Web开发,对Servlet和JSP的工作原理和生命周期有深入了解,熟练的使用JSTL和EL编写无脚本动态页面,有使用监听器.过滤器等Web组件以及MVC架构 ...
docker初识-docker安装、基于docker安装mysql及tomcat、基本命令
一.docker是什么用go语言开发,开源的应用容器引擎,容器性能开销极低二.整体架构图 Docker 包括三个基本概念: 镜像(Image):Docker 镜像(Image),就相当于是一个 r ...
【转载】使用Anemometer基于pt-query-digest将MySQL慢查询可视化
原文地址:使用Anemometer基于pt-query-digest将MySQL慢查询可视化作者:84223932 本文主要介绍使用Anemometer基于pt-query-digest将MySQL ...
Mysql,Oracle,Java数据类型对应
Mysql Oracle Java BIGINT NUMBER(19,0) java.lang.Long BIT RAW byte[] BLOB BLOB RAW byte[] CHAR CHAR j ...
xPool - 基于mysqlclient的mysql的c++连接池 - xnhcx的个人空间 - 开源中国社区
xPool - 基于mysqlclient的mysql的c++连接池 - xnhcx的个人空间 - 开源中国社区 xPool - 基于mysqlclient的mysql的c++连接池
c# 基于文件系统实现的队列处理类
现实业务中经常遇到需要队列处理的问题. 问题场景: 客户端记录设备运行数据,传输给服务器.在传输中可能存在延迟或中断情况.当中断时,系统传输数据可能因为无法传输或电脑重启,会导致服务器数据记录不连续. ...
MySQL基于GTIDs的MySQL Replication
MySQL M-S GTID 基于GTIDs的MySQL Replication 什么是GTIDs以及有什么特定? 1.GTIDs(Global transaction identifiers)全局事 ...
基于SSL实现Mysql加密主从
Mysql主从复制是明文传输的,对于一些特殊的场合是绝对不允许的,数据的安全性会受到威胁,在这里,简单的构建基于SSL的mysql主从复制 Ps:这里采用master-mysql为CA服务器主端生成 ...

随机推荐

Ubuntu系统---进行C++项目开发的工具
Ubuntu系统---进行C++项目开发的工具在Ubuntu系统下进行C++工作任务,还没接触过.像 Windows + vs 一样,Ubuntu应该也有自己的C++开发工具.网上搜罗了一圈,发现有 ...
csrf简单明了( 转发)
https://www.daguanren.cc/post/csrf-introduction.html csrf_token = request.META.get('CSRF_COOKIE') re ...
C语言I作业12一学期总结
一.我学到的内容二.我的收获作业收获 C语言I博客作业01 学会了编程"Hello word" C语言I博客作业02 安装编译器,将代码建立在自己的文件里面 C语言I博客作业 ...
配置IIS使其支持APK文件的下载
在管理工具里打开Internet 信息服务(IIS)管理器.然后选择需要配置的网站. 右侧的界面中会显示该网站的所有功能配置,我们选择并点击进入“MIME类型” 在左侧的操作区选择点击“添加”MI ...
P3355 骑士共存问题【洛谷】(二分图最大独立集变形题) //链接矩阵存图
展开题目描述在一个 n*n个方格的国际象棋棋盘上,马(骑士)可以攻击的棋盘方格如图所示.棋盘上某些方格设置了障碍,骑士不得进入对于给定的 n*n 个方格的国际象棋棋盘和障碍标志,计算棋盘上最多可 ...
VS tools
官方下载,有免费也有试用的 http://visualstudiogallery.msdn.microsoft.com/ VS2012简单的使用感受+插件推荐 http://blog.sina.com ...
webservice的优缺点
优点: 1.采用xml支持跨平台远程调用. 2.基于http的soap协议,可跨越防火墙 3.支持面向对象开发 4.有利于软件和数据的重用,实现松耦合. 缺点: 1.由于soap是基于xml传输,本身 ...
【FTP】详解
FTP协议及工作原理 1. FTP协议什么是FTP呢?FTP 是 TCP/IP 协议组中的协议之一,是英文File Transfer Protocol的缩写. 该协议是Internet文件传送的 ...
codeforces gym #101161E - ACM Tax（lca+主席树）
题目链接: http://codeforces.com/gym/101161/attachments 题意: 给出节点数为$n$的树有$q$次询问,输出$a$节点到$b$节点路程中,经过的边的中位数 ...
树莓派中将caplock映射为esc键
据说,喜欢vimer都呵caplock有仇,明明caplock占着原来esc的位置,却从来没有起到应有的作用,你说气人吗,没关系,我改啊:将下面语句加入到.bashrc中,启动即可xmodmap -e ...

基于文件系统（及MySQL）使用Java实现MapReduce

基于文件系统（及MySQL）使用Java实现MapReduce的更多相关文章

随机推荐

热门专题