零碎记录Hadoop平台各组件使用
>20161011 :数据导入研究
0.sqoop报warning,需要安装accumulo;
1.下载Microsoft sql server jdbc, 使用ie下载,将42版jar包放入sqoop的lib下,注意所有自动安装的hadoop相关软件被置于/usr/hdp目录下
2.sqoop list-databases --connect jdbc:sqlserver://172.4.25.98 --username sa --password sa12345
3.sqoop list-tables --connect 'jdbc:sqlserver://172.4.25.98;database=wind_data;' --username sa --password sa12345
4. sqoop import --query 'select datatime,lastvalue from WT0001_R010 where $CONDITIONS' --connect 'jdbc:sqlserver://172.4.25.98;database=wind_data' --username sa --password sa12345 --hbase-table 'myy' --hbase-create-table --hbase-row-key datatime -split-by datatime -m 2 --column-family datatime
>20161013:MR框架研究
1.在Ambari Server上启动Spark时其实只启动了History Server,启动Spark还需要start-master.sh;
2.MR的job history web port: 19888;Spark的job history则是:8081(8080+1);
3.在ide中运行分布式程序实际上是单机模式,想要真正分布式运行需要将代码提交至集群中的master,故而有先打包然后在程序中调用函数设置jar包的方法,这种方法本身的最大作用我理解其jar包中程序本身并非全部,而是在上层程序中自动化执行分布式任务,然后在代码中继续执行其他任务;
4.yarn logs -applicationId
5.cluster运行时错误:SparkContext did not initialize after waiting for 100000 ms. Please check earlier log output for errors. Failing the application. 之前还有,未找到mainclass;client模式下依赖可以找到(在Hadoop_jar文件夹),cluster模式下不行;
6.<***org.apache.hadoop.mapr.Mapper***>:
Maps input key/value pairs to a set of intermediate key/value pairs.
Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. /***注意,与函数式语言的map函数不同,函数式语言的map对每个列表数据进行同一操作,结果至少是数据条数不变的(数据类型可能改变),下一句说明MR下map后数据条数也可能改变,从这一点看,网上很多MR原理图有误;***/A given input pair may map to zero or many output pairs.
The Hadoop Map-Reduce framework spawns one map task for each InputSplit
generated by the InputFormat
for the job./***map导致数据条数变化的原因:输入数据实质上是由InputFormat决定索引的,而不是普通列表***/ Mapper
implementations can access the JobConf
for the job via the JobConfigurable.configure(JobConf)
and initialize themselves. Similarly they can use the Closeable.close()
method for de-initialization.
The framework then calls map(Object, Object, OutputCollector, Reporter)
for each key/value pair in the InputSplit
for that task.
All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to a Reducer
to determine the final output. Users can control the grouping by specifying a Comparator
via JobConf.setOutputKeyComparatorClass(Class)
.
The grouped Mapper
outputs are partitioned per Reducer
./***生成中间键值对后相同键的数据自动聚类,这是隐藏分布式处理的关键过程***/ Users can control which keys (and hence records) go to which Reducer
by implementing a custom Partitioner
.
Users can optionally specify a combiner
, via JobConf.setCombinerClass(Class)
, to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper
to the Reducer
.
The intermediate, grouped outputs are always stored in SequenceFile
s. Applications can specify if and how the intermediate outputs are to be compressed and which CompressionCodec
s are to be used via the JobConf
.
If the job has zero reduces then the output of the Mapper
is directly written to the FileSystem
without grouping by keys./***注意默认处理以提高程序效率***/
<***org.apache.hadoop.mapreduce.Mapper***>:
区别于上面的Mapper接口,这个Mapper是类:The framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context)
, followed by map(Object, Object, org.apache.hadoop.mapreduce.Mapper.Context)
for each key/value pair in the InputSplit
. Finally cleanup(org.apache.hadoop.mapreduce.Mapper.Context)
is called.
7.在map输出写到缓冲区之前,会进行一个partition操作,即分区操作。MapReduce提供Partitioner接口,它的作用就是根据key或value及reduce的数量来决定当前的这对输出数据最终应该交由哪个reduce task处理。默认对key hash后再以reduce task数量取模。默认的取模方式只是为了平均reduce的处理能力,如果用户自己对Partitioner有需求,可以订制并设置到job上。这里注意三点:
a:相同的key将散列为相同的值;
b:以Reduce task数量取模,则得到(0-Reduce task数量-1)个数值,刚好与Reduce task数量对应;
c:并不能保证每个Reduce task处理的key数量一致;
8.MapReduce的第一步——split的过程和hdfs的文件分割没有直接关系,Mapper要根据InputFormats读取HDFS文件然后再进行分割。Each input type implementation knows how to split itself into meaningful ranges for processing as separate map tasks(e.g. text mode`s range splitting ensures that range splits occur only at line boundries)。
9.Spark通过共享中间数据,避免中间键值对写入DFS从而提高效率。以迭代为例,迭代的每一步都存在本来要写入DFS的中间结果。刚接触这个思想的时候在想:中间结果只有map函数返回之前才会写入DFS,为什么不把迭代过程写入同一个map函数从而避免DFS存取呢?
实际上:
1.考虑这样定义迭代:一个计算过程以自己的输出作为下一次的输入,典型的,map函数以自己的输出作为map函数的输入;所以从这个定义上讲,map函数必须在执行过程结束之前返回并重新执行;
2./**********/上面的定义从实践上讲是语义正确的吗?如果在map函数里实现一个递归调用不是也能实现迭代的功能吗?这一切和内存抽象的关系又是怎样的?
3.问题可能不在1.2.所述的point上,可能真正需要内存抽象的是不同组件的数据共享,迭代只是一个例子,mapreduce的真正缺陷是没有这样的内存抽象实现供编程者将map的结果共享;
4.与1.2.所述相近的真实问题是:算法需要重复利用map之后的结果,或者需要对Reduce之后的结果map从而需要结果内存化,例如:
val points = spark.textFile(...).map(parsePoint).persist()//重复利用map之后的结果,所以用persist持久化
for w = //random innitial vector
for(i<-1 to INTERATIONS){
val gradient = points.map{
p.x*(1/(1+exp(-p.y*(w dot p.x)))-1)*p.y
}.reduce((a,b)=>a+b)
w-=gradient//gradient 是Reduce之后才能得到的
}
10.The split is a <i>logical</i> split of the inputs and the input files are not physically split into chunks. For e.g. a split could be <i><input-file-path, start, offset></i> tuple. The InputFormat also creates the {@link RecordReader} to read the {@link InputSplit}.
>20161103ambari平台管理研究
1.lost heartbeats以后,需要重启ambari-agent(ambari-agent restart)才能恢复;
>20161103Spark-HBase 编程研究
1.
Master URL | Meaning |
---|---|
local |
Run Spark locally with one worker thread (i.e. no parallelism at all). |
local[K] |
Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). |
local[*] |
Run Spark locally with as many worker threads as logical cores on your machine. |
spark://HOST:PORT |
Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default. |
mesos://HOST:PORT |
Connect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use mesos://zk://... . To submit with --deploy-mode cluster , the HOST:PORT should be configured to connect to the MesosClusterDispatcher. |
yarn |
Connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode . The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. |
2.In practice, when running on a cluster, you will not want to hardcode master
in the program, but rather launch the application with spark-submit
and receive it there. However, for local testing and unit tests, you can pass “local” to run Spark in-process.
也就是说spark中可以硬编码master,但是为了配置方便,一般用shell启动,从而将master作为参数传入,所以不需要master参数;
3.理解力依赖:spark-->rdd-->mapreduce-->hadoop inputformat;
4.当数据传送给map时,map会将输入分片传送到InputFormat,InputFormat则调用方法getRecordReader()生成RecordReader,RecordReader再通过creatKey()、creatValue()方法创建可供map处理的<key,value>对。简而言之,InputFormat()方法是用来生成可供map处理的<key,value>对的。
5.A Result is backed by an array of Cell
objects, each representing an HBase cell defined by the row, family, qualifier, timestamp, and value.注意系统提供的TableInputFormat实现的分片操作返回的值类型也为Result。使用spark的scala接口调用newHadoopRDD将得到RDD[(ImmutableWritable, Result)。
特别注意:Result用来表示 Single row result of a Get
or Scan
query.
To get a complete mapping of all cells in the Result, which can include multiple families and multiple versions, use getMap()
.
To get a mapping of each family to its columns (qualifiers and values), including only the latest version of each, use getNoVersionMap()
.
To get a mapping of qualifiers to latest values for an individual family use getFamilyMap(byte[])
.
To get the latest value for a specific family and qualifier use getValue(byte[], byte[])
. A Result is backed by an array of Cell
objects, each representing an HBase cell defined by the row, family, qualifier, timestamp, and value.
6.HBase只在行水平上保证操作的原子性,即线程安全:当写线程在操作一行时,对该行的读线程将等待资源锁打开;对并发操作多行的情况则不保证原子性;
零碎记录Hadoop平台各组件使用的更多相关文章
- winform快速开发平台 -> 基础组件之分页控件
一个项目控件主要由及部分的常用组件,当然本次介绍的是通用分页控件. 处理思想:我们在处理分页过程中主要是针对数据库操作. 一般情况主要是传递一些开始位置,当前页数,和数据总页数以及相关关联的业务逻辑. ...
- 高可用Hadoop平台-Flume NG实战图解篇
1.概述 今天补充一篇关于Flume的博客,前面在讲解高可用的Hadoop平台的时候遗漏了这篇,本篇博客为大家讲述以下内容: Flume NG简述 单点Flume NG搭建.运行 高可用Flume N ...
- 大数据Hadoop学习之搭建hadoop平台(2.2)
关于大数据,一看就懂,一懂就懵. 一.概述 本文介绍如何搭建hadoop分布式集群环境,前面文章已经介绍了如何搭建hadoop单机环境和伪分布式环境,如需要,请参看:大数据Hadoop学习之搭建had ...
- Hadoop平台基本组成
1.Hadoop系统运行于一个由普通商用服务器组成的计算集群上,能提供大规模分布式数据存储资源的同时,也提供了大规模的并行化计算资源. 2.Hadoop生态系统 3.MapReduce并行计算框架 M ...
- 高可用Hadoop平台-Oozie工作流之Hadoop调度
1.概述 在<高可用Hadoop平台-Oozie工作流>一篇中,给大家分享了如何去单一的集成Oozie这样一个插件.今天为大家介绍如何去使用Oozie创建相关工作流运行与Hadoop上,已 ...
- 高可用Hadoop平台-答疑篇
1.概述 这篇博客不涉及到具体的编码,只是解答最近一些朋友心中的疑惑.最近,一些朋友和网友纷纷私密我,我总结了一下,疑问大致包含以下几点: 我学 Hadoop 后能从事什么岗位? 在遇到问题,我该如何 ...
- 高可用Hadoop平台-实战
1.概述 今天继续<高可用的Hadoop平台>系列,今天开始进行小规模的实战下,前面的准备工作完成后,基本用于统计数据的平台都拥有了,关于导出统计结果的文章留到后面赘述.今天要和大家分享的 ...
- 高可用Hadoop平台-集成Hive HAProxy
1.概述 这篇博客是接着<高可用Hadoop平台>系列讲,本篇博客是为后面用 Hive 来做数据统计做准备的,介绍如何在 Hadoop HA 平台下集成高可用的 Hive 工具,下面我打算 ...
- 高可用Hadoop平台-探索
1.概述 上篇<高可用Hadoop平台-启航>博客已经让我们初步了解了Hadoop平台:接下来,我们对Hadoop做进一步的探索,一步一步的揭开Hadoop的神秘面纱.下面,我们开始赘述今 ...
随机推荐
- lvs-dr模式原理详解和可能存在的“假负载均衡”
原文地址: http://blog.csdn.net/lengzijian/article/details/8089661 lvs-dr模式原理 转载注明出处:http://blog.csdn.net ...
- hdu oj Period (kmp的应用)
Period Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Others) Total Sub ...
- Python 存储模型
1.Python彻底分离了对象和引用,可以认为内存中的对象都是不可修改的,每次修改引用,相当于在堆上重新创建一个对象,引用指向新对象. 2.对于数值和字符串,修改意味着引用指向一个新对象. 3.集合中 ...
- Codeforces Round #190 (Div. 2) E. Ciel the Commander 点分治
E. Ciel the Commander Time Limit: 20 Sec Memory Limit: 256 MB 题目连接 http://www.codeforces.com/contest ...
- Android 图像压缩,和LRU算法使用的推荐链接
近两日,看的关于这些方面的一些教程数十篇,最好的当属google原版的教程了.国内有不少文章是翻译这个链接的. 需要注意的一点是:Android的SDK中的LRU算法在V4包和Util包中各有一个,推 ...
- unity3d快捷键大全
Unity是由Unity Technologies开发的一个让玩家轻松创建诸如三维视频游戏.建筑可视化.实时三维动画等类型互动内容的多平台的综合型游戏开发工具,是一个全面 整合的专业游戏引擎.Unit ...
- iOS开发——UI篇Swift篇&UIPickerView
UIPickerView //返回按钮事件 @IBAction func backButtonClick() { self.navigationController?.popViewControlle ...
- XCODE4.6从零开始添加视图
转自:http://www.cnblogs.com/luoxs/archive/2012/09/23/2698995.html 对于很多初学者来说,肯定希望自己尝试不用傻瓜的“Single View ...
- 通过Javascript模拟登陆Windows认证的网站
<html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>模拟登陆</title ...
- VS环境下搭建自己NuGet服务器
一.NuGet服务端的搭建 环境:.NET 4.5 + VS2015 + NuGet.Server 2.10.1 1.建一个空的Web项目,取名叫NuGetServer 2.通过NuGet安装NuGe ...