1.在HDFS文件系统的根目录下创建递归目录“1daoyun/file”,将附件中的BigDataSkills.txt文件,上传到1daoyun/file目录中,使用相关命令查看文件系统中1daoyun/file目录的文件列表信息。

答:

[root@master MapReduce]# hadoop fs -mkdir -p /1daoyun/file

[root@master MapReduce]# hadoop fs -put BigDataSkills.txt /1daoyun/file

[root@master MapReduce]# hadoop fs -ls /1daoyun/file

Found 1 items

-rw-r--r--   3 root hdfs       1175 2018-02-12 08:01 /1daoyun/file/BigDataSkills.txt

2.在HDFS文件系统的根目录下创建递归目录“1daoyun/file”,将附件中的BigDataSkills.txt文件,上传到1daoyun/file目录中,上传过程指定BigDataSkills.txt文件在HDFS文件系统中的复制因子为2,并使用fsck工具检查存储块的副本数。

答:

[root@master MapReduce]# hadoop fs -mkdir -p /1daoyun/file

[root@master MapReduce]# hadoop fs -D dfs.replication=2 -put BigDataSkills.txt /1daoyun/file

[root@master MapReduce]# hadoop fsck /1daoyun/file/BigDataSkills.txt

DEPRECATED: Use of this script to execute hdfs command is deprecated.

Instead use the hdfs command for it.

Connecting to namenode via http://master.hadoop:50070/fsck?ugi=root&path=%2F1daoyun%2Ffile%2FBigDataSkills.txt

FSCK started by root (auth:SIMPLE) from /10.0.6.123 for path /1daoyun/file/BigDataSkills.txt at Mon Feb 12 08:11:47 UTC 2018

.

/1daoyun/file/BigDataSkills.txt:  Under replicated BP-297530755-10.0.6.123-1518056860260:blk_1073746590_5766. Target Replicas is 2 but found 1 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).

Status: HEALTHY

Total size: 1175 B

Total dirs: 0

Total files: 1

Total symlinks: 0

Total blocks (validated): 1 (avg. block size 1175 B)

Minimally replicated blocks: 1 (100.0 %)

Over-replicated blocks: 0 (0.0 %)

Under-replicated blocks: 1 (100.0 %)

Mis-replicated blocks: 0 (0.0 %)

Default replication factor: 3

Average block replication: 1.0

Corrupt blocks: 0

Missing replicas: 1 (50.0 %)

Number of data-nodes: 1

Number of racks: 1

FSCK ended at Mon Feb 12 08:11:47 UTC 2018 in 1 milliseconds

The filesystem under path '/1daoyun/file/BigDataSkills.txt' is HEALTHY

3.HDFS文件系统的根目录下存在一个/apps的文件目录,要求开启该目录的可创建快照功能,并为该目录文件创建快照,快照名称为apps_1daoyun,使用相关命令查看该快照文件的列表信息。

答:

[hdfs@master ~]# hadoop dfsadmin -allowSnapshot /apps

Allowing snaphot on /apps succeeded

[hdfs@master ~]# hadoop fs -createSnapshot /apps apps_1daoyun

Created snapshot /apps/.snapshot/apps_1daoyun

[hdfs@master ~]# hadoop fs -ls /apps/.snapshot

Found 1 items

drwxrwxrwx   - hdfs hdfs          0 2017-05-07 09:48 /apps/.snapshot/apps_1daoyun

4.为了防止操作人员误删文件,HDFS文件系统提供了回收站的功能,但过多的垃圾文件会占用大量的存储空间。要求在Linux Shell中使用“vi”命令修改相应的配置文件以及参数信息,关闭回收站功能。完成后,重启相应的服务。

答:

[root@master ~]# vi /etc/hadoop/ 2.6.1.0-129/0/hdfs-site.xml

<property>

<name>fs.trash.interval</name>

<value>0</value>

</property>

[root@master ~]# su - hdfs

Last login: Mon May  8 09:31:52 UTC 2017

[hdfs@master ~]$ /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf stop namenode

[hdfs@master ~]$ /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf start namenode

[hdfs@master ~]$ /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf stop datanode

[hdfs@master ~]$ /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf start datanode

5.使用命令查看hdfs文件系统中/tmp目录下的目录个数,文件个数和文件总大小。

答:

[root@master ~]# hadoop fs -count  /tmp

21            6               4336 /tmp

6.在集群节点中/usr/hdp/ 2.6.1.0-129/hadoop-mapreduce/目录下,存在一个案例JAR包hadoop-mapreduce-examples.jar。运行JAR包中的wordcount程序来对/1daoyun/file/BigDataSkills.txt文件进行单词计数,将运算结果输出到/1daoyun/output目录中,使用相关命令查询单词计数结果。

答:

[root@master ~]# hadoop jar /usr/hdp/ 2.6.1.0-129/hadoop-mapreduce/hadoop-mapreduce-examples-2.7.3.2.6.1.0-129.jar wordcount /1daoyun/file/BigDataSkills.txt /1daoyun/output

[root@master ~]# hadoop fs -cat /1daoyun/output/part-r-00000

"duiya  1

hello   1

nisibusisha     1

wosha"  1

zsh     1

7.在集群节点中/usr/hdp/ 2.6.1.0-129/hadoop-mapreduce/目录下,存在一个案例JAR包hadoop-mapreduce-examples.jar。运行JAR包中的sudoku程序来计算下表中数独运算题的结果。

8

3

6

7

9

2

5

7

4

5

7

1

3

1

6

8

8

5

1

9

4

答:

[root@master ~]# cat puzzle1.dta

8 ? ? ? ? ? ? ? ?

? ? 3 6 ? ? ? ? ?

? 7 ? ? 9 ? 2 ? ?

? 5 ? ? ? 7 ? ? ?

? ? ? ? 4 5 7 ? ?

? ? ? 1 ? ? ? 3 ?

? ? 1 ? ? ? ? 6 8

? ? 8 5 ? ? ? 1 ?

? 9 ? ? ? ? 4 ? ?

[root@master hadoop-mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.1.2.4.3.0-227.jar sudoku /root/puzzle1.dta

WARNING: Use "yarn jar" to launch YARN applications.

Solving /root/puzzle1.dta

8 1 2 7 5 3 6 4 9

9 4 3 6 8 2 1 7 5

6 7 5 4 9 1 2 8 3

1 5 4 2 3 7 8 9 6

3 6 9 8 4 5 7 2 1

2 8 7 1 6 9 5 3 4

5 2 1 9 7 4 3 6 8

4 3 8 5 2 6 9 1 7

7 9 6 3 1 8 4 5 2

Found 1 solutions

8.在集群节点中/usr/hdp/2.6.1.0-129/hadoop-mapreduce/目录下,存在一个案例JAR包hadoop-mapreduce-examples.jar。运行JAR包中的grep程序来统计文件系统中/1daoyun/file/BigDataSkills.txt文件中“Hadoop”出现的次数,统计完成后,查询统计结果信息。

答:

[root@master hadoop-mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.1.2.4.3.0-227.jar grep /1daoyun/file/BigDataSkills.txt /output hadoop

[root@master hadoop-mapreduce]# hadoop fs -cat /output/part-r-00000

2       hadoop

9.启动先电大数据平台的Hbase数据库,其中要求使用master节点的RegionServer。在Linux Shell中启动Hbase shell,查看进入HBase shell的当前系统用户。(相关数据库命令语言请全部使用小写格式)

答:

hbase(main):003:0> whoami

root (auth:SIMPLE)

groups: root

10.开启HBase的安全认证功能,在HBase Shell中设置root用户拥有表xiandian_user的读写与执行的权限,设置完成后,使用相关命令查看其权限信息。

答:

参数 Enable Authorization

参数值 native

hbase(main):002:0> grant 'root','RWX','xiandian_user'

0 row(s) in 0.4800 seconds

hbase(main):003:0> user_permission 'xiandian_user'

User                                             Namespace,Table,Family,Qualifier:Permission

root                                            default,xiandian_user,,: [Permission: actions=READ,WRITE,EXEC]

1 row(s) in 0.1180 seconds

11. 登录hbase数据库,创建一张表为member,列族为'address','info',创建完之后,向该表插入数据,插入的数据为:

'xiandianA','info:age','24'

'xiandianA','info:birthday','1990-07-17'

'xiandianA','info:company','alibaba'

'xiandianA','address:contry','china'

'xiandianA','address:province','zhejiang'

'xiandianA','address:city','hangzhou'

插入完毕后,使用命令查询member表中xiandianA的所有info信息,最后将xiandianA的年龄改为99,并只查询info:age信息。

答:

hbase(main):001:0> create 'member','address','info'

0 row(s) in 1.5730 seconds

=> Hbase::Table - member

hbase(main):002:0> list

TABLE

emp

member

2 row(s) in 0.0240 seconds

hbase(main):007:0> put'member','xiandianA','info:age','24'

0 row(s) in 0.1000 seconds

hbase(main):008:0> put'member','xiandianA','info:birthday','1990-07-17'

0 row(s) in 0.0130 seconds

hbase(main):010:0> put'member','xiandianA','info:company','alibaba'

0 row(s) in 0.0080 seconds

hbase(main):011:0> put'member','xiandianA','address:contry','china'

0 row(s) in 0.0080 seconds

hbase(main):012:0> put'member','xiandianA','address:province','zhejiang'

0 row(s) in 0.0070 seconds

hbase(main):013:0> put'member','xiandianA','address:city','hangzhou'

0 row(s) in 0.0090 seconds

hbase(main):014:0> get 'member','xiandianA','info'

COLUMN                  CELL

info:age               timestamp=1522140592336, value=24

info:birthday          timestamp=1522140643072, value=1990-07-17

info:company           timestamp=1522140745172, value=alibaba

3 row(s) in 0.0170 seconds

hbase(main):015:0>

hbase(main):016:0* put 'member','xiandianA','info:age','99'

0 row(s) in 0.0080 seconds

hbase(main):018:0> get 'member','xiandianA','info:age'

COLUMN                  CELL

info:age               timestamp=1522141564423, value=99

1 row(s) in 0.0140 seconds

12.在关系数据库系统中,命名空间namespace是表的逻辑分组,同一组中的表有类似的用途。登录hbase数据库,新建一个命名空间叫newspace并用list查询,然后在这个命名空间中创建表member,列族为'address','info',创建完之后,向该表插入数据,插入的数据为:

'xiandianA','info:age','24'

'xiandianA','info:birthday','1990-07-17'

'xiandianA','info:company','alibaba'

'xiandianA','address:contry','china'

'xiandianA','address:province','zhejiang'

'xiandianA','address:city','hangzhou'

插入完毕后,使用scan命令只查询表中info:age的信息,指定startrow为xiandianA。

答:

hbase(main):022:0> create_namespace 'newspace'

0 row(s) in 0.1130 seconds

hbase(main):024:0> list

TABLE

emp

member

newspace:member

3 row(s) in 0.0100 seconds

=> ["emp", "member", "newspace:member"]

hbase(main):023:0> create 'newspace:member','address','info'

0 row(s) in 1.5270 seconds

hbase(main):033:0> put 'newspace:member','xiandianA','info:age','24'

0 row(s) in 0.0620 seconds

hbase(main):037:0> put 'newspace:member','xiandianA','info:birthday','1990-07-17'

0 row(s) in 0.0110 seconds

hbase(main):038:0> put 'newspace:member','xiandianA','info:company','alibaba'

0 row(s) in 0.0130 seconds

hbase(main):039:0> put 'newspace:member','xiandianA','address:contry','china'

0 row(s) in 0.0070 seconds

hbase(main):040:0> put 'newspace:member','xiandianA','address:province','zhejiang'

0 row(s) in 0.0070 seconds

hbase(main):041:0> put 'newspace:member','xiandianA','address:city','hangzhou'

0 row(s) in 0.0070 seconds

hbase(main):044:0> scan 'newspace:member', {COLUMNS => ['info:age'],STARTROW => 'xiandianA'}

ROW                                              COLUMN+CELL

xiandianA                                       column=info:age, timestamp=1522214952401, value=24

1 row(s) in 0.0160 seconds

13.登录master节点,在本地新建一个文件叫hbasetest.txt文件,编写内容,要求新建一张表为'test', 列族为'cf',然后向这张表批量插入数据,数据如下所示:

'row1', 'cf:a', 'value1'

'row2', 'cf:b', 'value2'

'row3', 'cf:c', 'value3'

'row4', 'cf:d', 'value4'

在插入数据完毕后用scan命令查询表内容,然后用get命令只查询row1的内容,最后退出hbase shell。使用命令运行hbasetest.txt,将hbasetest.txt的内容和执行命令后的返回结果提交。

答:

[root@exam1 ~]# cat hbasetest.txt

create 'test', 'cf'

list 'test'

put 'test', 'row1', 'cf:a', 'value1'

put 'test', 'row2', 'cf:b', 'value2'

put 'test', 'row3', 'cf:c', 'value3'

put 'test', 'row4', 'cf:d', 'value4'

scan 'test'

get 'test', 'row1'

exit

[root@exam1 ~]# hbase shell hbasetest.txt

0 row(s) in 1.5010 seconds

TABLE

test

1 row(s) in 0.0120 seconds

0 row(s) in 0.1380 seconds

0 row(s) in 0.0090 seconds

0 row(s) in 0.0050 seconds

0 row(s) in 0.0050 seconds

ROW                     COLUMN+CELL

row1                   column=cf:a, timestamp=1522314428726, value=value1

row2                   column=cf:b, timestamp=1522314428746, value=value2

row3                   column=cf:c, timestamp=1522314428752, value=value3

row4                   column=cf:d, timestamp=1522314428758, value=value4

4 row(s) in 0.0350 seconds

COLUMN                  CELL

cf:a                   timestamp=1522314428726, value=value1

1 row(s) in 0.0190 seconds

14.使用Hive工具来创建数据表xd_phy_course,并定义该表为外部表,外部存储位置为/1daoyun/data/hive,将phy_course_xd.txt导入到该表中,其中xd_phy_course表的数据结构如下表所示。导入完成后,在hive中查询数据表xd_phy_course的数据结构信息。(相关数据库命令语言请全部使用小写格式)

stname(string)

stID(int)

class(string)

opt_cour(string)

答:

hive> create external table xd_phy_course (stname string,stID int,class string,opt_cour string) row format delimited fields terminated by '\t' lines terminated by '\n' location '/1daoyun/data/hive';

OK

Time taken: 1.197 seconds

hive> load data local inpath '/root/phy_course_xd.txt' into table xd_phy_course;

Loading data to table default.xd_phy_course

Table default.xd_phy_course stats: [numFiles=1, totalSize=89444]

OK

Time taken: 0.96 seconds

hive> desc xd_phy_course2;

OK

stname                  string

stid                    int

class                   string

opt_cour                string

Time taken: 0.588 seconds, Fetched: 4 row(s)

15.使用Hive工具来统计phy_course_xd.txt文件中某高校报名选修各个体育科目的总人数,其中phy_course_xd.txt文件数据结构如下表所示,选修科目字段为opt_cour,将统计的结果导入到表phy_opt_count中,通过SELECT语句查询表phy_opt_count内容。(相关数据库命令语言请全部使用小写格式)

stname(string)

stID(int)

class(string)

opt_cour(string)

答:

hive> create table xd_phy_course (stname string,stID int,class string,opt_cour string) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 4.067 seconds

hive> load data local inpath '/root/phy_course_xd.txt' into table xd_phy_course;

Loading data to table default.xd_phy_course

Table default.xd_phy_course stats: [numFiles=1, totalSize=89444]

OK

Time taken: 1.422 seconds

hive> create table phy_opt_count (opt_cour string,cour_count int) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 1.625 seconds

hive> insert overwrite table phy_opt_count select xd_phy_course.opt_cour,count(distinct xd_phy_course.stID) from xd_phy_course group by xd_phy_course.opt_cour;

Query ID = root_20170507125642_6af22d21-ae88-4daf-a346-4b1cbcd7d9fe

Total jobs = 1

Launching Job 1 out of 1

Tez session was closed. Reopening...

Session re-established.

Status: Running (Executing on YARN cluster with App id application_1494149668396_0004)

--------------------------------------------------------------------------------

VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

--------------------------------------------------------------------------------

Map 1 ..........   SUCCEEDED      1          1        0        0       0       0

Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0

--------------------------------------------------------------------------------

VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 4.51 s

--------------------------------------------------------------------------------

Loading data to table default.phy_opt_count

Table default.phy_opt_count stats: [numFiles=1, numRows=10, totalSize=138, rawDataSize=128]

OK

Time taken: 13.634 seconds

hive> select * from phy_opt_count;

OK

badminton       234

basketball      224

football        206

gymnastics      220

opt_cour        0

swimming        234

table tennis    277

taekwondo       222

tennis  223

volleyball      209

Time taken: 0.065 seconds, Fetched: 10 row(s)

16.使用Hive工具来统计phy_course_score_xd.txt文件中某高校各个班级体育课的平均成绩,使用round函数保留两位小数。其中phy_course_score_xd.txt文件数据结构如下表所示,班级字段为class,成绩字段为score。(相关数据库命令语言请全部使用小写格式)

stname(string)

stID(int)

class(string)

opt_cour(string)

score(float)

答:

hive> create table phy_course_score_xd (stname string,stID int,class string,opt_cour string,score float) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 0.339 seconds

hive> load data local inpath '/root/phy_course_score_xd.txt' into table phy_course_score_xd;

Loading data to table default.phy_course_score_xd

Table default.phy_course_score_xd stats: [numFiles=1, totalSize=1910]

OK

Time taken: 1.061 seconds

hive> select class,round(avg(score)) from phy_course_score_xd group by class;

Query ID = root_20170507131823_0bfb1faf-3bfb-42a5-b7eb-3a6a284081ae

Total jobs = 1

Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1494149668396_0005)

--------------------------------------------------------------------------------

VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

--------------------------------------------------------------------------------

Map 1 ..........   SUCCEEDED      1          1        0        0       0       0

Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0

--------------------------------------------------------------------------------

VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 26.68 s

--------------------------------------------------------------------------------

OK

Network_1401    73.0

Software_1403   72.0

class   NULL

Time taken: 27.553 seconds, Fetched: 3 row(s)

17.使用Hive工具来统计phy_course_score_xd.txt文件中某高校各个班级体育课的最高成绩。其中phy_course_score_xd.txt文件数据结构如下表所示,班级字段为class,成绩字段为score。(相关数据库命令语言请全部使用小写格式)

stname(string)

stID(int)

class(string)

opt_cour(string)

score(float)

答:

hive> create table phy_course_score_xd (stname string,stID int,class string,opt_cour string,score float) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 0.339 seconds

hive> load data local inpath '/root/phy_course_score_xd.txt' into table phy_course_score_xd;

Loading data to table default.phy_course_score_xd

Table default.phy_course_score_xd stats: [numFiles=1, totalSize=1910]

OK

Time taken: 1.061 seconds

hive> select class,max(score) from phy_course_score_xd group by class;

Query ID = root_20170507131942_86a2bf55-49ac-4c2e-b18b-8f63191ce349

Total jobs = 1

Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1494149668396_0005)

--------------------------------------------------------------------------------

VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

--------------------------------------------------------------------------------

Map 1 ..........   SUCCEEDED      1          1        0        0       0       0

Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0

--------------------------------------------------------------------------------

VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 5.08 s

--------------------------------------------------------------------------------

OK

Network_1401    95.0

Software_1403   100.0

class   NULL

Time taken: 144.035 seconds, Fetched: 3 row(s)

18.在Hive数据仓库将网络日志weblog_entries.txt中分开的request_date和request_time字段进行合并,并以一个下划线“_”进行分割,如下图所示,其中weblog_entries.txt的数据结构如下表所示。(相关数据库命令语言请全部使用小写格式)

md5(STRING)

url(STRING)

request_date (STRING)

request_time (STRING)

ip(STRING)

答:

hive> create table weblog_entries (md5 string,url string,request_date string,request_time string,ip string) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 0.502 seconds

hive> load data local inpath '/root/weblog_entries.txt' into table weblog_entries;

Loading data to table default.weblog_entries

Table default.weblog_entries stats: [numFiles=1, totalSize=251130]

OK

Time taken: 1.203 seconds

hive> select concat_ws('_', request_date, request_time) from weblog_entries;

2012-05-10_21:29:01

2012-05-10_21:13:47

2012-05-10_21:12:37

2012-05-10_21:34:20

2012-05-10_21:27:00

2012-05-10_21:33:53

2012-05-10_21:10:19

2012-05-10_21:12:05

2012-05-10_21:25:58

2012-05-10_21:34:28

Time taken: 0.265 seconds, Fetched: 3000 row(s)

19. 使用Hive动态地关于网络日志weblog_entries.txt的查询结果创建Hive表。通过创建一张名为weblog_entries_url_length的新表来定义新的网络日志数据库的三个字段,分别是url,request_date,request_time。此外,在表中定义一个获取url字符串长度名为“url_length”的新字段,其中weblog_entries.txt的数据结构如下表所示。完成后查询weblog_entries_url_length表文件内容。(相关数据库命令语言请全部使用小写格式)

md5(STRING)

url(STRING)

request_date (STRING)

request_time (STRING)

ip(STRING)

答:

hive> create table weblog_entries (md5 string,url string,request_date string,request_time string,ip string) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 0.502 seconds

hive> load data local inpath '/root/weblog_entries.txt' into table weblog_entries;

Loading data to table default.weblog_entries

Table default.weblog_entries stats: [numFiles=1, totalSize=251130]

OK

Time taken: 1.203 seconds

hive> create table weblog_entries_url_length as select url, request_date, request_time, length(url) as url_length from weblog_entries;

Query ID = root_20170507065123_e3105d8b-84b6-417f-ab58-21ea15723e0a

Total jobs = 1

Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1494136863427_0002)

--------------------------------------------------------------------------------

VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

--------------------------------------------------------------------------------

Map 1 ..........   SUCCEEDED      1          1        0        0       0       0

--------------------------------------------------------------------------------

VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 4.10 s

--------------------------------------------------------------------------------

Moving data to: hdfs://master:8020/apps/hive/warehouse/weblog_entries_url_length

Table default.weblog_entries_url_length stats: [numFiles=1, numRows=3000, totalSize=121379, rawDataSize=118379]

OK

Time taken: 5.874 seconds

hive> select * from weblog_entries_url_length;

/qnrxlxqacgiudbtfggcg.html      2012-05-10      21:29:01        26

/sbbiuot.html   2012-05-10      21:13:47        13

/ofxi.html      2012-05-10      21:12:37        10

/hjmdhaoogwqhp.html     2012-05-10      21:34:20        19

/angjbmea.html  2012-05-10      21:27:00        14

/mmdttqsnjfifkihcvqu.html       2012-05-10      21:33:53        25

/eorxuryjadhkiwsf.html  2012-05-10      21:10:19        22

/e.html 2012-05-10      21:12:05        7

/khvc.html      2012-05-10      21:25:58        10

/c.html 2012-05-10      21:34:28        7

Time taken: 0.08 seconds, Fetched: 3000 row(s)

20.在master和slaver节点安装Sqoop Clients,完成后,在master节点查看Sqoop的版本信息。

答:

[root@master ~]# sqoop version

Warning: /usr/hdp/2.4.3.0-227/accumulo does not exist! Accumulo imports will fail.

Please set $ACCUMULO_HOME to the root of your Accumulo installation.

17/05/07 06:56:25 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.3.0-227

Sqoop 1.4.6.2.4.3.0-227

git commit id d296ad374bd38a1c594ef0f5a2d565d71e798aa6

Compiled by jenkins on Sat Sep 10 00:58:52 UTC 2016

21.使用Sqoop工具列出master节点中MySQL中ambari数据库中所有的数据表。

答:

[root@master ~]# sqoop list-tables --connect jdbc:mysql://localhost/ambari --username root --password bigdata

Warning: /usr/hdp/2.4.3.0-227/accumulo does not exist! Accumulo imports will fail.

Please set $ACCUMULO_HOME to the root of your Accumulo installation.

17/05/07 07:07:01 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.3.0-227

17/05/07 07:07:01 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

17/05/07 07:07:02 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

ClusterHostMapping

QRTZ_BLOB_TRIGGERS

QRTZ_CALENDARS

QRTZ_CRON_TRIGGERS

QRTZ_FIRED_TRIGGERS

QRTZ_JOB_DETAILS

QRTZ_LOCKS

QRTZ_PAUSED_TRIGGER_GRPS

QRTZ_SCHEDULER_STATE

QRTZ_SIMPLE_TRIGGERS

QRTZ_SIMPROP_TRIGGERS

QRTZ_TRIGGERS

adminpermission

adminprincipal

adminprincipaltype

adminprivilege

adminresource

adminresourcetype

alert_current

alert_definition

alert_group

alert_group_target

alert_grouping

alert_history

alert_notice

alert_target

alert_target_states

ambari_sequences

artifact

blueprint

blueprint_configuration

clusterEvent

cluster_version

clusterconfig

clusterconfigmapping

clusters

clusterservices

clusterstate

confgroupclusterconfigmapping

configgroup

configgrouphostmapping

execution_command

groups

hdfsEvent

host_role_command

host_version

hostcomponentdesiredstate

hostcomponentstate

hostconfigmapping

hostgroup

hostgroup_component

hostgroup_configuration

hosts

hoststate

job

kerberos_descriptor

kerberos_principal

kerberos_principal_host

key_value_store

mapreduceEvent

members

metainfo

repo_version

request

requestoperationlevel

requestresourcefilter

requestschedule

requestschedulebatchrequest

role_success_criteria

servicecomponentdesiredstate

serviceconfig

serviceconfighosts

serviceconfigmapping

servicedesiredstate

stack

stage

task

taskAttempt

topology_host_info

topology_host_request

topology_host_task

topology_hostgroup

topology_logical_request

topology_logical_task

topology_request

upgrade

upgrade_group

upgrade_item

users

viewentity

viewinstance

viewinstancedata

viewinstanceproperty

viewmain

viewparameter

viewresource

widget

widget_layout

widget_layout_user_widget

workflow

22.在MySQL中创建名为xiandian的数据库,在xiandian数据库中创建xd_phy_course数据表,其数据表结构如表1所示。使用Hive工具来创建数据表xd_phy_course,将phy_course_xd.txt导入到该表中,其中xd_phy_course表的数据结构如表2所示。使用Sqoop工具将hive数据仓库中的xd_phy_course表导出到master节点的MySQL中xiandain数据库的xd_phy_course表。

表1

stname VARCHAR(20)

stID INT(1)

class VARCHAR(20)

opt_cour VARCHAR(20)

表2

stname(string)

stID(int)

class(string)

opt_cour(string)

答:

[root@master ~]# mysql -uroot -pbigdata

Welcome to the MariaDB monitor.  Commands end with ; or \g.

Your MariaDB connection id is 37

Server version: 5.5.44-MariaDB MariaDB Server

Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> create database xiandian;

Query OK, 1 row affected (0.00 sec)

MariaDB [(none)]> use xiandian;

Database changed

MariaDB [xiandian]> create table xd_phy_course(stname varchar(20),stID int(1),class varchar(20),opt_cour varchar(20));

Query OK, 0 rows affected (0.20 sec)

hive> create table xd_phy_course (stname string,stID int,class string,opt_cour string) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 3.136 seconds

hive> load data local inpath '/root/phy_course_xd.txt' into table xd_phy_course3;

Loading data to table default.xd_phy_course3

Table default.xd_phy_course3 stats: [numFiles=1, totalSize=89444]

OK

Time taken: 1.129 seconds

[root@master ~]# sqoop export --connect jdbc:mysql://localhost:3306/xiandian --username root --password bigdata --table xd_phy_course  --hcatalog-table xd_phy_course

Warning: /usr/hdp/2.4.3.0-227/accumulo does not exist! Accumulo imports will fail.

Please set $ACCUMULO_HOME to the root of your Accumulo installation.

17/05/07 07:29:48 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.3.0-227

17/05/07 07:29:48 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

17/05/07 07:29:48 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

17/05/07 07:29:48 INFO tool.CodeGenTool: Beginning code generation

17/05/07 07:29:48 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `xd_phy_course` AS t LIMIT 1

17/05/07 07:29:48 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `xd_phy_course` AS t LIMIT 1

17/05/07 07:29:48 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.3.0-227/hadoop-mapreduce

Note: /tmp/sqoop-root/compile/35d4b31b4d93274ba6bde54b3e56a821/xd_phy_course.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

17/05/07 07:29:50 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/35d4b31b4d93274ba6bde54b3e56a821/xd_phy_course.jar

17/05/07 07:29:50 INFO mapreduce.ExportJobBase: Beginning export of xd_phy_course

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.3.0-227/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

23.使用Pig工具在Local模式计算系统日志access-log.txt中的IP的点击数,要求使用GROUP BY语句按照IP进行分组,通过FOREACH 运算符,对关系的列进行迭代,统计每个分组的总行数,最后使用DUMP语句查询统计结果。

答:

grunt> copyFromLocal /root/Pig/access-log.txt /user/root/input/log1.txt

grunt> A =LOAD '/user/root/input/log1.txt' USING PigStorage (' ') AS (ip,others);

grunt> group_ip =group A by ip;

grunt> result =foreach group_ip generate group,COUNT(A);

grunt> dump result;

2018-02-13 08:13:36,520 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY

HadoopVersion PigVersion UserId StartedAt FinishedAt Features

2.7.3.2.6.1.0-129 0.16.0.2.6.1.0-129 root 2018-02-13 08:13:37 2018-02-13 08:13:41 GROUP_BY

Success!

Job Stats (time in seconds):

JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs

job_local963723433_0001 1 1 n/a n/a n/a n/a n/a n/a n/a n/a A,group_ip,result GROUP_BY,COMBINER file:/tmp/temp-1479363025/tmp133834330,

Input(s):

Successfully read 62991 records from: "/user/root/input/log1.txt"

Output(s):

Successfully stored 182 records in: "file:/tmp/temp-1479363025/tmp133834330"

(220.181.108.186,1)

(222.171.234.225,142)

(http://www.1daoyun.com/course/toregeister",1)

24.使用Pig工具计算天气数据集temperature.txt中年度最高气温,要求使用GROUP BY语句按照year进行分组,通过FOREACH 运算符,对关系的列进行迭代,统计每个分组的最大值,最后使用DUMP语句查询计算结果。

答:

grunt> copyFromLocal /root/Pig/temperature.txt /user/root/temp.txt

grunt> A = LOAD '/user/root/temp.txt' USING PigStorage(' ')AS (year:int,temperature:int);

grunt> B = GROUP A BY year;

grunt> C = FOREACH B GENERATE group,MAX(A.temperature);

grunt> dump C;

2018-02-13 08:18:52,107 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY

(2012,40)

(2013,36)

(2014,37)

(2015,39)

25.使用Pig工具统计数据集ip_to_country中每个国家的IP地址数。要求使用GROUP BY语句按照国家进行分组,通过FOREACH 运算符,对关系的列进行迭代,统计每个分组的IP地址数目,最后将统计结果保存到/data/pig/output目录中,并查看数据结果。

答:

grunt> copyFromLocal /root/Pig/ip_to_country.txt /user/root/ip_to_country.txt

grunt> ip_countries = LOAD '/user/root/ip_to_country.txt' AS (ip: chararray, country:chararray);

grunt> country_grpd = GROUP ip_countries BY country;

grunt> country_counts = FOREACH country_grpd GENERATE FLATTEN(group),COUNT(ip_countries) as counts;

grunt> STORE country_counts INTO '/data/pig/output';

2018-02-13 08:23:35,621 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY

Moldova, Republic of 1

Syrian Arab Republic 1

United Arab Emirates 2

Bosnia and Herzegovina 1

Iran, Islamic Republic of 2

Tanzania, United Republic of 1

26.在master节点安装Mahout Client,打开Linux Shell运行mahout命令查看Mahout自带的案例程序。

答:

[root@master ~]# mahout

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.

Running on hadoop, using /usr/hdp/2.6.1.0-129/hadoop/bin/hadoop and HADOOP_CONF_DIR=/usr/hdp/2.6.1.0-129/hadoop/conf

MAHOUT-JOB: /usr/hdp/2.6.1.0-129/mahout/mahout-examples-0.9.0.2.6.1.0-129-job.jar

An example program must be given as the first argument.

Valid program names are:

arff.vector: : Generate Vectors from an ARFF file or directory

baumwelch: : Baum-Welch algorithm for unsupervised HMM training

buildforest: : Build the random forest classifier

canopy: : Canopy clustering

cat: : Print a file or resource as the logistic regression models would see it

cleansvd: : Cleanup and verification of SVD output

clusterdump: : Dump cluster output to text

clusterpp: : Groups Clustering Output In Clusters

cmdump: : Dump confusion matrix in HTML or text formats

concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix

cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)

cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.

describe: : Describe the fields and target variable in a data set

evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes

fkmeans: : Fuzzy K-means clustering

hmmpredict: : Generate random sequence of observations by given HMM

itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering

kmeans: : K-means clustering

lucene.vector: : Generate Vectors from a Lucene index

lucene2seq: : Generate Text SequenceFiles from a Lucene index

matrixdump: : Dump matrix in CSV format

matrixmult: : Take the product of two matrices

parallelALS: : ALS-WR factorization of a rating matrix

qualcluster: : Runs clustering experiments and summarizes results in a CSV

recommendfactorized: : Compute recommendations using the factorization of a rating matrix

recommenditembased: : Compute recommendations using item-based collaborative filtering

regexconverter: : Convert text files on a per line basis based on regular expressions

resplit: : Splits a set of SequenceFiles into a number of equal splits

rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}

rowsimilarity: : Compute the pairwise similarities of the rows of a matrix

runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model

runlogistic: : Run a logistic regression model against CSV data

seq2encoded: : Encoded Sparse Vector generation from Text sequence files

seq2sparse: : Sparse Vector generation from Text sequence files

seqdirectory: : Generate sequence files (of Text) from a directory

seqdumper: : Generic Sequence File dumper

seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives

seqwiki: : Wikipedia xml dump to sequence file

spectralkmeans: : Spectral k-means clustering

split: : Split Input data into test and train sets

splitDataset: : split a rating dataset into training and probe parts

ssvd: : Stochastic SVD

streamingkmeans: : Streaming k-means clustering

svd: : Lanczos Singular Value Decomposition

testforest: : Test the random forest classifier

testnb: : Test the Vector-based Bayes classifier

trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model

trainlogistic: : Train a logistic regression using stochastic gradient descent

trainnb: : Train the Vector-based Bayes classifier

transpose: : Take the transpose of a matrix

validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set

vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors

vectordump: : Dump vectors from a sequence file to text

viterbi: : Viterbi decoding of hidden states from given output states sequence

27.使用Mahout挖掘工具对数据集user-item-score.txt(用户-物品-得分)进行物品推荐,要求采用基于项目的协同过滤算法,欧几里得距离公式定义,并且每位用户的推荐个数为3,设置非布尔数据,最大偏好值为4,最小偏好值为1,将推荐输出结果保存到output目录中,通过-cat命令查询输出结果part-r-00000中的内容 。

答:

[hdfs@master ~]$ hadoop fs -mkdir -p /data/mahout/project

[hdfs@master ~]$ hadoop fs -put user-item-score.txt /data/mahout/project

[hdfs@master ~]$ mahout recommenditembased -i /data/mahout/project/ user-item-score.txt -o /data/mahout/project/output -n 3 -b false -s SIMILARITY_EUCLIDEAN_DISTANCE --maxPrefsPerUser 4 --minPrefsPerUser 1 --maxPrefsInItemSimilarity 4 --tempDir /data/mahout/project/temp

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.

Running on hadoop, using /usr/hdp/2.4.3.0-227/hadoop/bin/hadoop and

17/05/15 19:37:25 INFO driver.MahoutDriver: Program took 259068 ms (Minutes: 4.3178)

[hdfs@master ~]$ hadoop fs -cat /data/mahout/project/output/part-r-00000

1       [105:3.5941463,104:3.4639049]

2       [106:3.5,105:2.714964,107:2.0]

3       [103:3.59246,102:3.458911]

4       [107:4.7381864,105:4.2794304,102:4.170158]

5       [103:3.8962872,102:3.8564017,107:3.7692602]

28.在master节点安装启动Flume组件,打开Linux Shell运行flume-ng的帮助命令,查看Flume-ng的用法信息。

答:

[root@master ~]# flume-ng help

Usage: /usr/hdp/2.6.1.0-129/flume/bin/flume-ng.distro <command> [options]...

commands:

help                  display this help text

agent                 run a Flume agent

avro-client           run an avro Flume client

password              create a password file for use in flume config

version               show Flume version info

global options:

--conf,-c <conf>      use configs in <conf> directory

--classpath,-C <cp>   append to the classpath

--dryrun,-d           do not actually start Flume, just print the command

--plugins-path <dirs> colon-separated list of plugins.d directories. See the

plugins.d section in the user guide for more details.

Default: $FLUME_HOME/plugins.d

-Dproperty=value      sets a Java system property value

-Xproperty=value      sets a Java -X option

agent options:

--conf-file,-f <file> specify a config file (required)

--name,-n <name>      the name of this agent (required)

--help,-h             display help text

avro-client options:

--rpcProps,-P <file>   RPC client properties file with server connection params

--host,-H <host>       hostname to which events will be sent

--port,-p <port>       port of the avro source

--dirname <dir>        directory to stream to avro source

--filename,-F <file>   text file to stream to avro source (default: std input)

--headerFile,-R <file> File containing event headers as key/value pairs on each new line

--help,-h              display help text

Either --rpcProps or both --host and --port must be specified.

password options:

--outfile              The file in which encoded password is stored

Note that if <conf> directory is specified, then it is always included first

in the classpath.

29. 根据提供的模板hdfs-example.conf文件,使用Flume NG工具设置master节点的系统路径/opt/xiandian/为实时上传文件至HDFS文件系统的实时路径,设置HDFS文件系统的存储路径为/data/flume/,上传后的文件名保持不变,文件类型为DataStream,然后启动flume-ng agent。

答:

[root@master ~]# flume-ng agent --conf-file hdfs-example.conf --name master -Dflume.root.logger=INFO,cnsole

Warning: No configuration directory set! Use --conf <dir> to override.

Info: Including Hadoop libraries found via (/bin/hadoop) for HDFS access

Info: Excluding /usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-api-1.7.10.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-log4j12-1.7.10.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/tez/lib/slf4j-api-1.7.5.jar from classpath

Info: Including HBASE libraries found via (/bin/hbase) for HBASE access

Info: Excluding /usr/hdp/2.4.3.0-227/hbase/lib/slf4j-api-1.7.7.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-api-1.7.10.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-log4j12-1.7.10.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/tez/lib/slf4j-api-1.7.5.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-api-1.7.10.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-log4j12-1.7.10.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/zookeeper/lib/slf4j-api-1.6.1.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/zookeeper/lib/slf4j-log4j12-1.6.1.jar from classpath

Info: Including Hive libraries found via () for Hive access

[root@master ~]# cat hdfs-example.conf

# example.conf: A single-node Flume configuration

# Name the components on this agent

master.sources = webmagic

master.sinks = k1

master.channels = c1

# Describe/configure the source

master.sources.webmagic.type = spooldir

master.sources.webmagic.fileHeader = true

master.sources.webmagic.fileHeaderKey = fileName

master.sources.webmagic.fileSuffix = .COMPLETED

master.sources.webmagic.deletePolicy = never

master.sources.webmagic.spoolDir = /opt/xiandian/

master.sources.webmagic.ignorePattern = ^$

master.sources.webmagic.consumeOrder = oldest

master.sources.webmagic.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder

master.sources.webmagic.batchsize = 5

master.sources.webmagic.channels = c1

# Use a channel which buffers events in memory

master.channels.c1.type = memory

# Describe the sink

master.sinks.k1.type = hdfs

master.sinks.k1.channel = c1

master.sinks.k1.hdfs.path = hdfs://master:8020/data/flume/%{dicName}

master.sinks.k1.hdfs.filePrefix = %{fileName}

master.sinks.k1.hdfs.fileType = DataStream

30.在先电大数据平台部署Spark服务组件,打开Linux Shell启动spark-shell终端,将启动的程序进程信息提交。

答:

[root@master ~]# spark-shell

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Spark context Web UI available at http://172.24.2.110:4040

Spark context available as 'sc' (master = local[*], app id = local-1519375873795).

Spark session available as 'spark'.

Welcome to

____              __

/ __/__  ___ _____/ /__

_\ \/ _ \/ _ `/ __/  '_/

/___/ .__/\_,_/_/ /_/\_\   version 2.1.1.2.6.1.0-129

/_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)

Type in expressions to have them evaluated.

Type :help for more information.

scala>

31.登录spark-shell,定义i值为1,sum值为0,使用while循环,求从1加到100的值,最后使用scala的标准输出函数输出sum值。

答:

scala> var i=1

i: Int = 1

scala> var sum=0

sum: Int = 0

scala> while(i<=100){

| sum+=i

| i=i+1

| }

scala> println(sum)

5050

32.登录spark-shell,定义一个list为(1,2,3,4,5,6,7,8,9),然后利用map函数,对这个list进行元素乘2的操作。

答:

scala> import scala.math._

import scala.math._

scala> val nums=List(1,2,3,4,5,6,7,8,9)

nums: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9)

scala> nums.map(x=>x*2)

res18: List[Int] = List(2, 4, 6, 8, 10, 12, 14, 16, 18)

33.登录spark-shell,定义一个list为("Hadoop","Java","Spark"),然后利用flatmap函数将list转换为单个字母并转换为大写。

答:

scala> val data = List("Hadoop","Java","Spark")

data: List[String] = List(Hadoop, Java, Spark)

scala> println(data.flatMap(_.toUpperCase))

List(H, A, D, O, O, P, J, A, V, A, S, P, A, R, K)

34.登录大数据云主机master节点,在root目录下新建一个abc.txt,内容为:

hadoop  hive

solr    redis

kafka   hadoop

storm   flume

sqoop   docker

spark   spark

hadoop  spark

elasticsearch   hbase

hadoop  hive

spark   hive

hadoop  spark

然后登录spark-shell,首先使用命令统计abc.txt的行数,接着对abc.txt文档中的单词进行计数,并按照单词首字母的升序进行排序,最后统计结果行数。

答:

scala> val words=sc.textFile("file:///root/abc.txt").count

words: Long = 11

scala> val words=sc.textFile("file:///root/abc.txt").flatMap(_.split("\\W+")).map(x=>(x,1)).reduceByKey(_+_).sortByKey().collect

words: Array[(String, Int)] = Array((docker,1), (elasticsearch,1), (flume,1), (hadoop,5), (hbase,1), (hive,3), (kafka,1), (redis,1), (solr,1), (spark,5), (sqoop,1), (storm,1))

scala> val words=sc.textFile("file:///root/abc.txt").flatMap(_.split("\\W+")).map(x=>(x,1)).reduceByKey(_+_).count

words: Long = 12

35. 登录spark-shell,定义一个List(1,2,3,3,4,4,5,5,6,6,6,8,9),使用spark自带函数对这个list进行去重操作。

答:

scala> val l = List(1,2,3,3,4,4,5,5,6,6,6,8,9)

l: List[Int] = List(1, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 8, 9)

scala> l.distinct

res1: List[Int] = List(1, 2, 3, 4, 5, 6, 8, 9)

大数据(bigdata)练习题的更多相关文章

  1. 互联网+大数据解决方案(ppt)

    from: 互联网+大数据解决方案(ppt) 导读:大数据(bigdata),或称巨量资料,指的是所涉及的资料量规模巨大到无法透过目前主流软件工具,在合理时间内达到撷取.管理.处理.并整理成为帮助企业 ...

  2. 《一张图看懂华为云BigData Pro鲲鹏大数据解决方案》

    8月27日,华为云重磅发布了业界首个鲲鹏大数据解决方案--BigData Pro.该方案采用基于公有云的存储与计算分离架构,以可无限弹性扩容的鲲鹏算力作为计算资源,以支持原生多协议的OBS对象存储服务 ...

  3. 华为云BigData Pro解读: 鲲鹏云容器助力大数据破茧成蝶

    华为云鲲鹏云容器 见证BigData Pro蝶变之旅大数据之路顺应人类科技的进步而诞生,一直顺风顺水,不到20年时间,已渗透到社会生产和人们生活的方方面面,.然而,伴随着信息量的指数级增长,大数据也开 ...

  4. 大数据学习之BigData常用算法和数据结构

    大数据学习之BigData常用算法和数据结构 1.Bloom Filter     由一个很长的二进制向量和一系列hash函数组成     优点:可以减少IO操作,省空间     缺点:不支持删除,有 ...

  5. 【原创】Thinking in BigData (1)大数据简介

    提到大数据,就不得不提到Hadoop,提到Hadoop,就不得不提到Google公布的3篇研究论文:GFS.MapReduce.BigTable,Google确实是一家伟大的公司,开启了全球的大数据时 ...

  6. BigData:值得了解的十大数据发展趋势

    当今,世界无时无刻不在发生着变化.对于技术领域而言,普遍存在的一个巨大变化就是为大数据(Big data)打开了大门,并应用大数据技相关技术来改善各行业的业务并促进经济的发展.目前,大数据的作用已经上 ...

  7. 开源分布式计算引擎 & 开源搜索引擎 Iveely 0.5.0 为大数据而生

    Iveely Computing 产生背景 08年的时候,我开始接触搜索引擎,当时遇到的第一个难题就是大数据实时并发处理,当时实验室的机器我们可以随便用,至少二三十台机器,可以,却没有程序可以将这些机 ...

  8. [Hadoop 周边] Hadoop和大数据:60款顶级大数据开源工具(2015-10-27)【转】

    说到处理大数据的工具,普通的开源解决方案(尤其是Apache Hadoop)堪称中流砥柱.弗雷斯特调研公司的分析师Mike Gualtieri最近预测,在接下来几年,“100%的大公司”会采用Hado ...

  9. [Hadoop 周边] 浅谈大数据(hadoop)和移动开发(Android、IOS)开发前景【转】

    原文链接:http://www.d1net.com/bigdata/news/345893.html 先简单的做个自我介绍,我是云6期的,黑马相比其它培训机构的好偶就不在这里说,想比大家都比我清楚: ...

随机推荐

  1. SQL 删除重复记录,只保留一条记录

    DELETE FROM py_bond_shenzhen_exchange_opinion_2_1 WHERE id NOT IN (SELECT id FROM (SELECT min(id) AS ...

  2. c语言1博客作业07

    一.本周作业头 这个作业属于那个课程 C语言程序设计II 这个作业要求在哪里 https://edu.cnblogs.com/campus/zswxy/SE2019-3/homework/9929 我 ...

  3. Java中的Listener 监听器

    Listener的定义与作用 监听器Listener就是在application,session,request三个对象创建.销毁或者往其中添加修改删除属性时自动执行代码的功能组件. Listener ...

  4. partial 部分类

    partial 关键字允许把类.结构.方法或接口放在多个文件中.一般情况下,一个类全部驻留在单个文件中.但有时,多个开发人员需要访问同一个类,或者某种类型的代码生成器生成了一个类的某部分,所以把类放在 ...

  5. CF350E Wrong Floyd

    洛谷题目链接 前言: 这题其实真的不难 回归正题: 我们首先要明白$floyd$的思想,相信你都来做这道水题了,肯定不陌生,简单的手玩后,我们可以发现: 只要有任意一个点只跟非标记点相连的话,是更新不 ...

  6. UVA 11754 Code Feat 中国剩余定理+枚举

    Code FeatUVA - 11754 题意:给出c个彼此互质的xi,对于每个xi,给出ki个yj,问前s个ans满足ans%xi的结果在yj中有出现过. 一看便是个中国剩余定理,但是同余方程组就有 ...

  7. 在Android中使用OpenGL ES进行开发第(二)节:定义图形

    一.前期基础知识储备笔者计划写三篇文章来详细分析OpenGL ES基础的同时也是入门关键的三个点: ①OpenGL ES是什么?与OpenGL的关系是什么?——概念部分 ②使用OpenGLES绘制2D ...

  8. pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your path && FileNotFoundError: [WinError 2] 系统找不到指定的文件。

    C:\Users\k\Desktop\test>python test.py Traceback (most recent call last): File , in run_tesseract ...

  9. MySQL的概述和基础(学习整理)

    1. 数据库基本概念 数据库(DataBase,DB)是用来存储和管理数据的仓库.与其他种类存储和管理数据的方式有所不同的是,数据库是兼持久化存储数据.便捷存储管理数据.使用统一的方式操作数据库几个特 ...

  10. java基础篇之Object类

    1.Object类是所有类的超类 2.Object类的equals方法 public boolean equals(Object obj) {return (this == obj);} equals ...