零碎记录Hadoop平台各组件使用

>20161011 :数据导入研究
   0.sqoop报warning，需要安装accumulo；
   1.下载Microsoft sql server jdbc, 使用ie下载，将42版jar包放入sqoop的lib下，注意所有自动安装的hadoop相关软件被置于/usr/hdp目录下
   2.sqoop list-databases --connect jdbc:sqlserver://172.4.25.98 --username sa --password sa12345
   3.sqoop list-tables --connect 'jdbc:sqlserver://172.4.25.98;database=wind_data;' --username sa --password sa12345
   4. sqoop import --query 'select datatime,lastvalue from WT0001_R010 where $CONDITIONS' --connect 'jdbc:sqlserver://172.4.25.98;database=wind_data' --username sa    --password sa12345 --hbase-table 'myy' --hbase-create-table --hbase-row-key datatime -split-by datatime -m 2 --column-family datatime
>20161013:MR框架研究
   1.在Ambari Server上启动Spark时其实只启动了History Server，启动Spark还需要start-master.sh；
   2.MR的job history web port: 19888；Spark的job history则是:8081(8080+1)；
   3.在ide中运行分布式程序实际上是单机模式，想要真正分布式运行需要将代码提交至集群中的master，故而有先打包然后在程序中调用函数设置jar包的方法，这种方法本身的最大作用我理解其jar包中程序本身并非全部，而是在上层程序中自动化执行分布式任务，然后在代码中继续执行其他任务；
   4.yarn logs -applicationId
   5.cluster运行时错误：SparkContext did not initialize after waiting for 100000 ms. Please check earlier log output for errors. Failing the application. 之前还有,未找到mainclass；client模式下依赖可以找到（在Hadoop_jar文件夹），cluster模式下不行；

6.<***org.apache.hadoop.mapr.Mapper***>:

　　Maps input key/value pairs to a set of intermediate key/value pairs.

　　Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. /***注意，与函数式语言的map函数不同，函数式语言的map对每个列表数据进行同一操作，结果至少是数据条数不变的(数据类型可能改变)，下一句说明MR下map后数据条数也可能改变，从这一点看，网上很多MR原理图有误；***/A given input pair may map to zero or many output pairs.

　　The Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job./***map导致数据条数变化的原因：输入数据实质上是由InputFormat决定索引的，而不是普通列表***/ Mapper implementations can access the JobConf for the job via the JobConfigurable.configure(JobConf) and initialize themselves. Similarly they can use the Closeable.close() method for de-initialization.

　　The framework then calls map(Object, Object, OutputCollector, Reporter) for each key/value pair in the InputSplit for that task.

　　All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to a Reducer to determine the final output. Users can control the grouping by specifying a Comparator via JobConf.setOutputKeyComparatorClass(Class).

　　The grouped Mapper outputs are partitioned per Reducer./***生成中间键值对后相同键的数据自动聚类，这是隐藏分布式处理的关键过程***/ Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.

Users can optionally specify a combiner, via JobConf.setCombinerClass(Class), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.

　　The intermediate, grouped outputs are always stored in SequenceFiles. Applications can specify if and how the intermediate outputs are to be compressed and which CompressionCodecs are to be used via the JobConf.

　　If the job has zero reduces then the output of the Mapper is directly written to the FileSystem without grouping by keys./***注意默认处理以提高程序效率***/

　　<***org.apache.hadoop.mapreduce.Mapper***>:

　　区别于上面的Mapper接口，这个Mapper是类：The framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context), followed by map(Object, Object, org.apache.hadoop.mapreduce.Mapper.Context) for each key/value pair in the InputSplit. Finally cleanup(org.apache.hadoop.mapreduce.Mapper.Context) is called.

　　7.在map输出写到缓冲区之前，会进行一个partition操作，即分区操作。MapReduce提供Partitioner接口，它的作用就是根据key或value及reduce的数量来决定当前的这对输出数据最终应该交由哪个reduce task处理。默认对key hash后再以reduce task数量取模。默认的取模方式只是为了平均reduce的处理能力，如果用户自己对Partitioner有需求，可以订制并设置到job上。这里注意三点：

　　a:相同的key将散列为相同的值；

　　b:以Reduce task数量取模，则得到(0-Reduce task数量-1)个数值，刚好与Reduce task数量对应；

　　c:并不能保证每个Reduce task处理的key数量一致；

　　8.MapReduce的第一步——split的过程和hdfs的文件分割没有直接关系，Mapper要根据InputFormats读取HDFS文件然后再进行分割。Each input type implementation knows how to split itself into meaningful ranges for processing as separate map tasks(e.g. text mode`s range splitting ensures that range splits occur only at line boundries)。

　　9.Spark通过共享中间数据，避免中间键值对写入DFS从而提高效率。以迭代为例，迭代的每一步都存在本来要写入DFS的中间结果。刚接触这个思想的时候在想：中间结果只有map函数返回之前才会写入DFS，为什么不把迭代过程写入同一个map函数从而避免DFS存取呢?

　　实际上：

　　1.考虑这样定义迭代：一个计算过程以自己的输出作为下一次的输入，典型的，map函数以自己的输出作为map函数的输入；所以从这个定义上讲，map函数必须在执行过程结束之前返回并重新执行；

　　2./**********/上面的定义从实践上讲是语义正确的吗?如果在map函数里实现一个递归调用不是也能实现迭代的功能吗?这一切和内存抽象的关系又是怎样的?

　　3.问题可能不在1.2.所述的point上，可能真正需要内存抽象的是不同组件的数据共享，迭代只是一个例子，mapreduce的真正缺陷是没有这样的内存抽象实现供编程者将map的结果共享；

　　4.与1.2.所述相近的真实问题是：算法需要重复利用map之后的结果，或者需要对Reduce之后的结果map从而需要结果内存化，例如：

　　　　val points = spark.textFile(...).map(parsePoint).persist()//重复利用map之后的结果，所以用persist持久化

　　　　for w = //random innitial vector

　　　　for(i<-1 to INTERATIONS){

　　　　　　val gradient = points.map{

　　　　　　　　　p.x*(1/(1+exp(-p.y*(w dot p.x)))-1)*p.y

　　　　　　}.reduce((a,b)=>a+b)

　　　　　　w-=gradient//gradient 是Reduce之后才能得到的

　　　　}

　　１０．The split is a <i>logical</i> split of the inputs and the input files are not physically split into chunks. For e.g. a split could be <i><input-file-path, start, offset></i> tuple. The InputFormat also creates the {@link RecordReader} to read the {@link InputSplit}.

>20161103ambari平台管理研究

　　1.lost heartbeats以后，需要重启ambari-agent（ambari-agent restart）才能恢复；

>20161103Spark-HBase 编程研究

Master URL	Meaning
`local`	Run Spark locally with one worker thread (i.e. no parallelism at all).
`local[K]`	Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
`local[*]`	Run Spark locally with as many worker threads as logical cores on your machine.
`spark://HOST:PORT`	Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.
`mesos://HOST:PORT`	Connect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use `mesos://zk://...`. To submit with `--deploy-mode cluster`, the HOST:PORT should be configured to connect to the MesosClusterDispatcher.
`yarn`	Connect to a YARN cluster in `client` or `cluster` mode depending on the value of `--deploy-mode`. The cluster location will be found based on the `HADOOP_CONF_DIR` or `YARN_CONF_DIR` variable.

　　2.In practice, when running on a cluster, you will not want to hardcode master in the program, but rather launch the application with spark-submit and receive it there. However, for local testing and unit tests, you can pass “local” to run Spark in-process.

　　也就是说spark中可以硬编码master，但是为了配置方便，一般用shell启动，从而将master作为参数传入，所以不需要master参数；

　　3.理解力依赖：spark-->rdd-->mapreduce-->hadoop inputformat；

　　4.当数据传送给map时，map会将输入分片传送到InputFormat，InputFormat则调用方法getRecordReader()生成RecordReader，RecordReader再通过creatKey()、creatValue()方法创建可供map处理的<key,value>对。简而言之，InputFormat()方法是用来生成可供map处理的<key,value>对的。

　　5.A Result is backed by an array of Cell objects, each representing an HBase cell defined by the row, family, qualifier, timestamp, and value.注意系统提供的TableInputFormat实现的分片操作返回的值类型也为Result。使用spark的scala接口调用newHadoopRDD将得到RDD[(ImmutableWritable, Result)。

　　特别注意：Result用来表示 Single row result of a Get or Scan query.

　　To get a complete mapping of all cells in the Result, which can include multiple families and multiple versions, use getMap().

　　To get a mapping of each family to its columns (qualifiers and values), including only the latest version of each, use getNoVersionMap().

　　To get a mapping of qualifiers to latest values for an individual family use getFamilyMap(byte[]).

　　To get the latest value for a specific family and qualifier use getValue(byte[], byte[]). A Result is backed by an array of Cell objects, each representing an HBase cell defined by the row, family, qualifier, timestamp, and value.

　　6.HBase只在行水平上保证操作的原子性，即线程安全：当写线程在操作一行时，对该行的读线程将等待资源锁打开；对并发操作多行的情况则不保证原子性；

零碎记录Hadoop平台各组件使用的更多相关文章

winform快速开发平台 -> 基础组件之分页控件
一个项目控件主要由及部分的常用组件,当然本次介绍的是通用分页控件. 处理思想:我们在处理分页过程中主要是针对数据库操作. 一般情况主要是传递一些开始位置,当前页数,和数据总页数以及相关关联的业务逻辑. ...
高可用Hadoop平台－Flume NG实战图解篇
1.概述今天补充一篇关于Flume的博客,前面在讲解高可用的Hadoop平台的时候遗漏了这篇,本篇博客为大家讲述以下内容: Flume NG简述单点Flume NG搭建.运行高可用Flume N ...
大数据Hadoop学习之搭建hadoop平台（2.2）
关于大数据,一看就懂,一懂就懵. 一.概述本文介绍如何搭建hadoop分布式集群环境,前面文章已经介绍了如何搭建hadoop单机环境和伪分布式环境,如需要,请参看:大数据Hadoop学习之搭建had ...
Hadoop平台基本组成
1.Hadoop系统运行于一个由普通商用服务器组成的计算集群上,能提供大规模分布式数据存储资源的同时,也提供了大规模的并行化计算资源. 2.Hadoop生态系统 3.MapReduce并行计算框架 M ...
高可用Hadoop平台－Oozie工作流之Hadoop调度
1.概述在<高可用Hadoop平台-Oozie工作流>一篇中,给大家分享了如何去单一的集成Oozie这样一个插件.今天为大家介绍如何去使用Oozie创建相关工作流运行与Hadoop上,已 ...
高可用Hadoop平台－答疑篇
1.概述这篇博客不涉及到具体的编码,只是解答最近一些朋友心中的疑惑.最近,一些朋友和网友纷纷私密我,我总结了一下,疑问大致包含以下几点: 我学 Hadoop 后能从事什么岗位? 在遇到问题,我该如何 ...
高可用Hadoop平台－实战
1.概述今天继续<高可用的Hadoop平台>系列,今天开始进行小规模的实战下,前面的准备工作完成后,基本用于统计数据的平台都拥有了,关于导出统计结果的文章留到后面赘述.今天要和大家分享的 ...
高可用Hadoop平台－集成Hive HAProxy
1.概述这篇博客是接着<高可用Hadoop平台>系列讲,本篇博客是为后面用 Hive 来做数据统计做准备的,介绍如何在 Hadoop HA 平台下集成高可用的 Hive 工具,下面我打算 ...
高可用Hadoop平台－探索
1.概述上篇<高可用Hadoop平台-启航>博客已经让我们初步了解了Hadoop平台:接下来,我们对Hadoop做进一步的探索,一步一步的揭开Hadoop的神秘面纱.下面,我们开始赘述今 ...

随机推荐

非常好的在网页中显示pdf的方法
今天有一需求,要在网页中显示pdf,于是立马开始搜索解决方案,无意中发现一个非常好的解决方法,详见http://blogs.adobe.com/pdfdevjunkie/web_designers_g ...
如何在使用了updatepanel后弹出提示信息
转载:http://www.cnblogs.com/brusehht/archive/2009/03/19/1416802.html 常情况下,我们在使用ajax利用updatepanel实现页面局部 ...
如何自学Java
转自:http://www.360doc.com/content/12/0624/19/5856897_220191533.shtml JAVA自学之路 JAVA自学之路一:学会选择为了就业, ...
Java学习笔记之方法重载，动态方法调度和抽象类
一.方法重载如果子类中的方法与它的超类中的方法有相同的方法名,则称子类中的方法重载超类中的方法,特别是当超类和子类中的方法名和参数类型都相同时,在子类中调用该方法时,超类中的方法会被隐藏.考虑下面程 ...
winForm 程序开发界面参数传递
1. using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; u ...
android之BitmapFactory.Options的使用
怎样获取图片的大小? 首先我们把这个图片转成Bitmap,然后再利用Bitmap的getWidth()和getHeight()方法就可以取到图片的宽高了. 新问题又来了,在通过BitmapFactor ...
iOS开发——图层OC篇&UIColor深入研究（CGColor，CIColor）
UIColor深入研究(CGColor,CIColor) 由于跟人比较喜欢研究关于图层与动画方面的技术,正打算看看别人写的好东西,就遇到了好几个问题, 第一:UIClor类方法的使用就是关于UICo ...
ios开发——实用技术篇&应用间跳转
应用之间的跳转说明:本文介绍app如何打开另一个app,并且传递数据. 一.简单说明新建两个应用,分别为应用A和应用B. 实现要求:在appA的页面中点击对应的按钮,能够打开appB这个应用. 1 ...
SQLServer恢复表级数据
最近几天,公司的技术维护人员频繁让我恢复数据库,因为他们总是少了where条件,导致update.delete出现了无法恢复的后果,加上那些库都是几十G.恢复起来少说也要十几分钟.为此,找了一些资料和 ...
Category目录
Category目录目录概述——对Category的理解创建Category Category的用途概述——对Category的理解当我们想往原有的类中添加新的成员方法但又不想改变原有的类和 ...

零碎记录Hadoop平台各组件使用

零碎记录Hadoop平台各组件使用的更多相关文章

随机推荐

热门专题