《OD学hadoop》第二周0703

hdfs可视化界面： http://beifeng-hadoop-01:50070/dfshealth.html#tab-overview

yarn可视化界面： http://beifeng-hadoop-01:8088/cluster

历史服务器可视化界面：http://beifeng-hadoop-01:19888/

sbin/hadoop-daemon.sh start namenode

sbin/hadoop-daemon.sh start datanode

sbin/yarn-daemon.sh start resourcemanager

sbin/yarn-daemon.sh start nodemanager

sbin/mr-jobhistory-daemon.sh start historyserver

sbin/hadoop-daemon.sh stop namenode

sbin/hadoop-daemon.sh stop datanode

sbin/yarn-daemon.sh stop resourcemanager

sbin/yarn-daemon.sh stop nodemanager

sbin/mr-jobhistory-daemon.sh stop historyserver

一、替换本地库

mv native/ bak_native

tar -zxf native-**.gz -C /opt/modules/hadoop-2.5.0/lib

二、SecondaryNameNode

1、namenode 存储的是整个文件系统的元数据

2、格式化之后会产生一个目录

3、格式化之后还会产生文件初始的元数据

bin/hdfs namenode -format

4、元数据是放在内存中的

5、在namenode没有启动之前，元数据存在本地系统文件中

6、格式化之后，会生成一个fsimage文件

准确的说是文件系统的镜像文件，存储元数据

7、在HDFS上任何的操作，比如：上传，创建，会导致元数据发生改变

8、记录HDFS上操作的行为记录，操作日志，记录这些信息

edits logs 编辑日志文件

9、有了日志文件之后，namenode再次启动的时候首先会去读取fsimage

再去读取编辑日志文件 edits，这样就不怕丢失了

10、考虑有一个服务进程去定时的将fsimage和edits进行合并？

11、SecondaryNameNode会去读取fsimage和eitds，读到内存中

将内存中的东西，写到一个新的fsimage文件中，原来的两个文件就不需要了，接着再生成一个eitds文件，继续记录

注意：读取fsimage速度很快，读取edits速度很慢

12、SecondaryNameNode作用：

（1）合并

（2）减少一次namenode的启动时间

13、配置

hdfs-site.xml

<property>
<name>dfs.namenode.secondary.http-address</name>
<value>beifeng-hadoop-01:50090</value>

</property>

14、启动命令：

$ sbin/hadoop-daemon.sh start secondarynamenode

http://beifeng-hadoop-01:50090/status.html

fsimage: file:///opt/modules/hadoop-2.5.0/data/tmp/dfs/namesecondaryfsimage

edits: file:///opt/modules/hadoop-2.5.0/data/tmp/dfs/namesecondaryeidts

三、HDFS存储目录的配置

四、配置文件、客户端、服务端

1、Hadoop的配置文件有两类

默认的，自定义的

如果要提高集群性能，就可以通过修改配置来实现

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

每一个模块对应一个配置文件。

2、运行启动加载文件

（1）第一步加载默认的配置文件

（2）第二步加载自定义的配置文件

core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml

3、自定义配置文件优先级高于默认配置文件

4、hdfs有四个配置文件

core-default.xml
hdfs-default.xml
core-site.xml
hdfs-site.xml

5、服务端、namenode，datanode启动都会读取配置文件

6、客户端

七、ssh无密钥登陆

命令通过脚本执行，脚本通过ssh协议远程连接

生成公钥：ssh-keygen -t rsa

id_rsa

id_rsa.pub

ssh-copy-id beifeng-hadoop-01

known_hosts

authorized_keys

八、以Hadoop2.x 为核心的生态系统

计算框架： MapReduce

计算框架容器：YARN

数据存储： HDFS

操作系统：CentOS

数据来源：关系型数据库，日志文件 ======> HDFS

Sqoop：关系型数据库的表的数据<==>HDFS

http://blog.csdn.net/yfkiss/article/details/8700480

Flume: 实时抽取日志文件的数据，监控日志文件中的数据==>HDFS

Zookeeper：分布式协调框架

Hive: HiveQL语句，解析成mapreduce

Pig: 流式编程语言

实时查询：一张表，上亿的数据，快速检索，Bigtable->HBase(分布式数据库)

Oozie: 是一种框架,它让我们可以把多个Map/Reduce作业组合到一个逻辑工作单元中

CM：集成以上所述组件

九、 HDFS架构

1. namenode

2. datanode

复制的文件块是为了保证数据的安全性，

适用于大数据集，GB或者TB.

HDFS不适合的场景：

大量小文件处理；

多用户写入，任意修改文件；

HDFS文件和目录元数据存储在fsimage二进制文件中

edits

fsimage操作目的：

（1）从fsimage中读取HDFS中把偶错呢的每个目录和文件；

（2）初始化每个目录和文件的元数据信息；

（3）根据目录和文件的路径构造出整个命名空间在内存中的景象

（4）如果是文件，

十、HDFS Shell命令

[beifeng@beifeng-hadoop- hadoop-2.5.]$ bin/hdfs dfs

Usage: hadoop fs [generic options]

        [-appendToFile <localsrc> ... <dst>]

        [-cat [-ignoreCrc] <src> ...]

        [-checksum <src> ...]

        [-chgrp [-R] GROUP PATH...]

        [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]

        [-chown [-R] [OWNER][:[GROUP]] PATH...]

        [-copyFromLocal [-f] [-p] <localsrc> ... <dst>]

        [-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]

        [-count [-q] <path> ...]

        [-cp [-f] [-p | -p[topax]] <src> ... <dst>]

        [-createSnapshot <snapshotDir> [<snapshotName>]]

        [-deleteSnapshot <snapshotDir> <snapshotName>]

        [-df [-h] [<path> ...]]

        [-du [-s] [-h] <path> ...]

        [-expunge]

        [-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]

        [-getfacl [-R] <path>]

        [-getfattr [-R] {-n name | -d} [-e en] <path>]

        [-getmerge [-nl] <src> <localdst>]

        [-help [cmd ...]]

        [-ls [-d] [-h] [-R] [<path> ...]]

        [-mkdir [-p] <path> ...]

        [-moveFromLocal <localsrc> ... <dst>]

        [-moveToLocal <src> <localdst>]

        [-mv <src> ... <dst>]

        [-put [-f] [-p] <localsrc> ... <dst>]

        [-renameSnapshot <snapshotDir> <oldName> <newName>]

        [-rm [-f] [-r|-R] [-skipTrash] <src> ...]

        [-rmdir [--ignore-fail-on-non-empty] <dir> ...]

        [-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]

        [-setfattr {-n name [-v value] | -x name} <path>]

        [-setrep [-R] [-w] <rep> <path> ...]

        [-stat [format] <path> ...]

        [-tail [-f] <file>]

        [-test -[defsz] <path>]

        [-text [-ignoreCrc] <src> ...]

        [-touchz <path> ...]

        [-usage [cmd ...]]

Generic options supported are

-conf <configuration file>     specify an application configuration file

-D <property=value>            use value for given property

-fs <local|namenode:port>      specify a namenode

-jt <local|jobtracker:port>    specify a job tracker

-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster

-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.

-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is

bin/hadoop command [genericOptions] [commandOptions]

[beifeng@beifeng-hadoop- hadoop-2.5.]$ bin/hdfs

Usage: hdfs [--config confdir] COMMAND

       where COMMAND is one of:

  dfs                  run a filesystem command on the file systems supported in Hadoop.

  namenode -format     format the DFS filesystem

  secondarynamenode    run the DFS secondary namenode

  namenode             run the DFS namenode

  journalnode          run the DFS journalnode

  zkfc                 run the ZK Failover Controller daemon

  datanode             run a DFS datanode

  dfsadmin             run a DFS admin client

  haadmin              run a DFS HA admin client

  fsck                 run a DFS filesystem checking utility

  balancer             run a cluster balancing utility

  jmxget               get JMX exported values from NameNode or DataNode.

  oiv                  apply the offline fsimage viewer to an fsimage

  oiv_legacy           apply the offline fsimage viewer to an legacy fsimage

  oev                  apply the offline edits viewer to an edits file

  fetchdt              fetch a delegation token from the NameNode

  getconf              get config values from configuration

  groups               get the groups which users belong to

  snapshotDiff         diff two snapshots of a directory or diff the

                       current directory contents with a snapshot

  lsSnapshottableDir   list all snapshottable dirs owned by the current user

                                                Use -help to see options

  portmap              run a portmap service

  nfs3                 run an NFS version  gateway

  cacheadmin           configure the HDFS cache

Most commands print help when invoked w/o parameters.

[beifeng@beifeng-hadoop- hadoop-2.5.]$ bin/hdfs dfsadmin

Usage: java DFSAdmin

Note: Administrative commands can only be run as the HDFS superuser.

           [-report]

           [-safemode enter | leave | get | wait]

           [-allowSnapshot <snapshotDir>]

           [-saveNamespace]

           [-rollEdits]

           [-restoreFailedStorage true|false|check]

           [-refreshNodes]

           [-finalizeUpgrade]

           [-rollingUpgrade [<query|prepare|finalize>]]

           [-metasave filename]

           [-refreshServiceAcl]

           [-refreshUserToGroupsMappings]

           [-refreshSuperUserGroupsConfiguration]

           [-refreshCallQueue]

           [-refresh]

           [-printTopology]

           [-refreshNamenodes datanodehost:port]

           [-deleteBlockPool datanode-host:port blockpoolId [force]]

           [-setQuota <quota> <dirname>...<dirname>]

           [-clrQuota <dirname>...<dirname>]

           [-setSpaceQuota <quota> <dirname>...<dirname>]

           [-clrSpaceQuota <dirname>...<dirname>]

           [-setBalancerBandwidth <bandwidth in bytes per second>]

           [-fetchImage <local directory>]

           [-shutdownDatanode <datanode_host:ipc_port> [upgrade]]

           [-getDatanodeInfo <datanode_host:ipc_port>]

           [-help [cmd]]

Generic options supported are

-conf <configuration file>     specify an application configuration file

-D <property=value>            use value for given property

-fs <local|namenode:port>      specify a namenode

-jt <local|jobtracker:port>    specify a job tracker

-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster

-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.

-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is

bin/hadoop command [genericOptions] [commandOptions]

[beifeng@beifeng-hadoop- hadoop-2.5.]$

[beifeng@beifeng-hadoop- hadoop-2.5.]$ bin/hdfs dfsadmin

Usage: java DFSAdmin

Note: Administrative commands can only be run as the HDFS superuser.

           [-report]

           [-safemode enter | leave | get | wait]

           [-allowSnapshot <snapshotDir>]

           [-disallowSnapshot <snapshotDir>]

           [-saveNamespace]

           [-rollEdits]

           [-restoreFailedStorage true|false|check]

           [-refreshNodes]

           [-finalizeUpgrade]

           [-rollingUpgrade [<query|prepare|finalize>]]

           [-metasave filename]

           [-refreshServiceAcl]

           [-refreshUserToGroupsMappings]

           [-refreshSuperUserGroupsConfiguration]

           [-refreshCallQueue]

           [-refresh]

           [-printTopology]

           [-refreshNamenodes datanodehost:port]

           [-deleteBlockPool datanode-host:port blockpoolId [force]]

           [-setQuota <quota> <dirname>...<dirname>]

           [-clrQuota <dirname>...<dirname>]

           [-setSpaceQuota <quota> <dirname>...<dirname>]

           [-clrSpaceQuota <dirname>...<dirname>]

           [-setBalancerBandwidth <bandwidth in bytes per second>]

           [-fetchImage <local directory>]

           [-shutdownDatanode <datanode_host:ipc_port> [upgrade]]

           [-getDatanodeInfo <datanode_host:ipc_port>]

           [-help [cmd]]

Generic options supported are

-conf <configuration file>     specify an application configuration file

-D <property=value>            use value for given property

-fs <local|namenode:port>      specify a namenode

-jt <local|jobtracker:port>    specify a job tracker

-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster

-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.

-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is

bin/hadoop command [genericOptions] [commandOptions]

十一、安全模式

[beifeng@beifeng-hadoop- hadoop-2.5.]$ bin/hdfs dfsadmin -safemode

Usage: java DFSAdmin [-safemode enter | leave | get | wait]

安全模型下，是不能对文件进行操作的。

bin/hdfs dfsadmin -safemode get

bin/hdfs dfsadmin -safemode enter

bin/hdfs dfsadmin -safemode leave

十二、安装eclipse和maven

十三、Yarn

资源管理系统：资源分配和资源隔离

十四、HDFS API

Maven仓库常用地址

http://hadoop.apache.org/docs/r2.5.2/api/index.html

1. 获取hdfs文件系统

Configuration conf = new Configuration();

FileSystem fileSystem = FileSystem.get(conf);

syso(fileSystem);

十五、以YARN为核心的生态系统

1、hostonworks BATCH(MapReduce)

2、运行在YARN上的服务：长服务、短服务

3、apache silder

4、solar

《OD学hadoop》第二周0703的更多相关文章

《OD学hadoop》第二周0702
大数据离线计算hadoop2.x 三周(6天) markdown文本剪辑器罗振宇--跨年演讲,时间的朋友 http://tech.163.com/16/0101/11/BC87H8DF000915B ...
《OD学hadoop》第一周0625
一.实用网站 1. linux内核版本 www.kernel.org 2. 查看网站服务器使用的系统 www.netcraft.com 二.推荐书籍 1. <Hadoop权威指南> 1- ...
《OD学hadoop》第三周0710
一.分布式集群安装1. Hadoop模式本地模式.伪分布模式.集群模式datanode 使用的机器上的磁盘,存储空间nodemanager使用的机器上的内存和CPU(计算和分析数据) 2. 搭建环境准 ...
《OD学hadoop》第三周0709
一.MapReduce编程模型1. 中心思想: 分而治之2. map(映射)3. 分布式计算模型,处理海量数据4. 一个简单的MR程序需要制定map().reduce().input.output5. ...
《OD学hadoop》第一周0626
一.磁盘管理 Linux添加新硬盘.分区.格式化.自动挂载 http://lxsym.blog.51cto.com/1364623/321643 给Linux系统新增加一块硬盘 http://www. ...
《OD学hadoop》第一周0626 作业二：Linux基础
一.打包压缩知识点: tar -zxvf -C PATH tar -jxvf tar -zcvf tar -jcvf tar:打包命令 -z 打包同时gzip压缩 -j 打包同时bzip2 -c 打 ...
《OD学hadoop》第一周0625 LINUX作业一：Linux系统基本命令（一）
1. 1) vim /etc/udev/rules.d/-persistent-net.rules vi /etc/sysconfig/network-scripts/ifcfg-eth0 TYPE= ...
《OD学hadoop》第四周0716
7.16 一.回顾二.HDFS Federation(联盟) Hadoop 2.2.0发布新特性很多的大公司都在使用:BAT HDFS Federation + HDFS HA架构互相隔开,但是 ...
《OD学hadoop》20160903某旅游网项目实战
一.大数据的落地点 1.数据出售数据商城:以卖数据为公司的核心业务 2. 数据分析百度统计友盟 GA IBM analysis 3.搜索引擎 4. 推荐系统 mahout 百分比 5.精准营销 ...

随机推荐

Poj 1032 分类： Translation Mode 2014-04-04 09:09 111人阅读评论(0) 收藏
Parliament Time Limit: 1000MS Memory Limit: 10000K Total Submissions: 16521 Accepted: 6975 Descr ...
ATT GATT Profile
Bluetooth: ATT and GATT Bluetooth 4.0, which includes the Low Energy specification, brings two new c ...
Unity上使用Linq To XML
using UnityEngine; using System.Collections; using System.Linq; using System.Xml.Linq; using System; ...
nginx js、css多个请求合并为一个请求(concat模块)
模块介绍 mod_concat模块由淘宝开发,目前已经包含在tengine中,并且淘宝已经在使用这个nginx模块.不过塔暂时没有包含在nginx中.这个模块类似于apache中的modconcat. ...
Javascript Date类常用方法详解
getDate :得到的是今天是几号(1-28.29.30.31). getDay : 得到的是今天是星期几(1-7). getFullYear : 得到的是今天是几几年(4位). getH ...
Solr笔记--转载
Solr 是一种可供企业使用的.基于 Lucene 的搜索服务器,它支持层面搜索.命中醒目显示和多种输出格式.在这篇分两部分的文章中,Lucene Java™ 的提交人 Grant Ingersoll ...
HDU 4937 Lucky Number (数学，进制转换)
题目参考自博客:http://blog.csdn.net/a601025382s/article/details/38517783 //string &replace(iterator fi ...
POJ 3255
Roadblocks Time Limit: 2000MS Memory Limit: 65536K Total Submissions: 6605 Accepted: 2458 Descri ...
php string转换为int
本身 var_dump : string(3) "002" 本身 is_numeric : bool(true) 本身转换为数字 : int(2) 本身转换为数字变量 : in ...
POJ 1466
#include<iostream> #include<stdio.h> #define MAXN 505 using namespace std; int edge[MAXN ...

《OD学hadoop》第二周0703

《OD学hadoop》第二周0703的更多相关文章

随机推荐

热门专题