CDH quickstart vm包含了单节点的全套hadoop服务生态,可从https://www.cloudera.com/downloads/quickstart_vms/5-13.html下载。如下:

对应的节点如下(不包含Cloudera Navigator):

要学习完整的hadoop生态,最好是使用8C/32GB以上的服务器,4C/16GB勉强能跑、但是很勉强(最好使用2个以上节点)。

impala 使用c++编写(Spark使用Scala编写),采用MPP架构(类似于MariaDB Columnstore,也就是之前的infinidb),由下列组件组成:

Hue是一个Web智能查询分析器,能够进行语法提示,查询Impala、HDFS、HBase。如下:

其中impala服务器由Impala Daemon(执行SQL)、Impala Statestore(监控Daemon状态)、Impala Catalog(将DDL变更传输给Daemon节点,避免了DDL通过Impala执行时运行REFRESH/INVALIDATE METADATA的必要,通过Hive时,仍然需要)组成。impala-shell和mysql客户端类似,执行SQL。

Impala使用和Hive一样的元数据,其可以存储在mysql或postgresql中,称为metastore。

Impala使用HDFS作为主要的存储底层,利用其冗余特性。

Impala还支持Hbase作为存储,通过定义映射到Hbase的表,可以查询HBase表,还可以关联查询HBase和Impala表。

impala可以使用Cloudera Manager或命令行启动:

Cloudera Manager启动如下:

命令行启动(这种方式启动CM是无法监控到其状态的,而且进程也略有不同):

service impala-state-store start/restart/stop

service impala-catalog start/restart/stop

service impala-server start/restart/stop

CM启动后进程如下:

日志信息位于/var/log/impala,如下:

配置可通过CM修改、也可以修改配置文件/etc/default/impala。

impala客户端

[root@quickstart impala]# impala-shell
Starting Impala Shell without Kerberos authentication
Connected to quickstart.cloudera:
Server version: impalad version 2.10.-cdh5.13.0 RELEASE (build 2511805f1eaa991df1460276c7e9f19d819cd4e4)
***********************************************************************************
Welcome to the Impala shell.
(Impala Shell v2.10.0-cdh5.13.0 () built on Wed Oct :: PDT ) The HISTORY command lists all shell commands in chronological order.
***********************************************************************************
[quickstart.cloudera:] >

全新的Impala实例包含2个库:default(新创建表的默认库)以及_impala_builtins。

可通过show database/show table/select version()查看数据库信息(语法兼容SQL 92/MySQL,大部分NoSQL的参考实现,不同于Oracle)

[quickstart.cloudera:] > select version();
Query: select version()
Query submitted at: -- :: (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=ef4cbbf93a7662e5:f4e103f500000000
+-------------------------------------------------------------------------------------------+
| version() |
+-------------------------------------------------------------------------------------------+
| impalad version 2.10.-cdh5.13.0 RELEASE (build 2511805f1eaa991df1460276c7e9f19d819cd4e4) |
| Built on Wed Oct :: PDT |
+-------------------------------------------------------------------------------------------+
Fetched row(s) in .34s
[quickstart.cloudera:] > show databases;
Query: show databases
+------------------+----------------------------------------------+
| name | comment |
+------------------+----------------------------------------------+
| _impala_builtins | System database for Impala builtin functions |
| default | Default Hive database |
+------------------+----------------------------------------------+
Fetched row(s) in .01s
[quickstart.cloudera:] > show tables;
Query: show tables
+------+
| name |
+------+
| tab1 |
| tab2 |
| tab3 |
+------+
Fetched row(s) in .02s
[quickstart.cloudera:] > select * from tab1;
Query: select * from tab1
Query submitted at: -- :: (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=dd4debde8a589622:81983b200000000
+----+-------+------------+-------------------------------+
| id | col_1 | col_2 | col_3 |
+----+-------+------------+-------------------------------+
| | true | 123.123 | -- :: |
| | false | 1243.5 | -- :: |
| | false | 24453.325 | -- ::21.123000000 |
| | false | 243423.325 | -- ::21.334540000 |
| | true | 243.325 | -- :: |
+----+-------+------------+-------------------------------+
Fetched row(s) in .06s -- 第一次访问特别慢,因为需要加载到内存
[quickstart.cloudera:] > select * from tab1;
Query: select * from tab1
Query submitted at: -- :: (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=21486d50db995bf2:a2d061b400000000
+----+-------+------------+-------------------------------+
| id | col_1 | col_2 | col_3 |
+----+-------+------------+-------------------------------+
| | true | 123.123 | -- :: |
| | false | 1243.5 | -- :: |
| | false | 24453.325 | -- ::21.123000000 |
| | false | 243423.325 | -- ::21.334540000 |
| | true | 243.325 | -- :: |
+----+-------+------------+-------------------------------+
Fetched row(s) in .26s
[quickstart.cloudera:] > desc tab1;
Query: describe tab1
+-------+-----------+---------+
| name | type | comment |
+-------+-----------+---------+
| id | int | |
| col_1 | boolean | |
| col_2 | double | |
| col_3 | timestamp | |
+-------+-----------+---------+
Fetched row(s) in .03s
[quickstart.cloudera:] > select count() from tab1;
Query: select count() from tab1
Query submitted at: -- :: (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=9141cf84d4efd2e3:e667f1d00000000
+----------+
| count() |
+----------+
| |
+----------+
Fetched row(s) in .17s
[quickstart.cloudera:] > create database my_first_impala_db;
Query: create database my_first_impala_db
Fetched row(s) in .08s
[quickstart.cloudera:] > create table t1 (x int);
Query: create table t1 (x int)
Fetched row(s) in .08s
[quickstart.cloudera:] > insert into t1 values (), (), (), (); --支持mysql语法
Query: insert into t1 values (), (), (), ()
Query submitted at: -- :: (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=8c41d97919423a19:1f7aeeeb00000000
Modified row(s) in .17s
[quickstart.cloudera:] > insert into t1 select * from t1; --支持insert select
Query: insert into t1 select * from t1
Query submitted at: -- :: (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=9c452b951b08da49:92d9fee100000000
Modified row(s) in .23s

创建基于HDFS文件的表(需要先创建HDFS文件,并加载数据,参见Hadoop-HDFS学习笔记):

[quickstart.cloudera:] > DROP TABLE IF EXISTS tab2;
Query: drop TABLE IF EXISTS tab2
[quickstart.cloudera:] > CREATE EXTERNAL TABLE tab2
> (
> id INT,
> col_1 BOOLEAN,
> col_2 DOUBLE
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> LOCATION '/user/cloudera/sample_data/tab2';
Query: create EXTERNAL TABLE tab2
(
id INT,
col_1 BOOLEAN,
col_2 DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/cloudera/sample_data/tab2'
Fetched row(s) in .09s
[quickstart.cloudera:] > DROP TABLE IF EXISTS tab3;
Query: drop TABLE IF EXISTS tab3
[quickstart.cloudera:] > CREATE TABLE tab3
> (
> id INT,
> col_1 BOOLEAN,
> col_2 DOUBLE,
> month INT,
> day INT
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Query: create TABLE tab3
(
id INT,
col_1 BOOLEAN,
col_2 DOUBLE,
month INT,
day INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
Fetched row(s) in .09s

包含关联、子查询、聚合的SQL查询:

SELECT tab2.*
FROM tab2,
(SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2
FROM tab2, tab1
WHERE tab1.id = tab2.id
GROUP BY col_1) subquery1
WHERE subquery1.max_col2 = tab2.col_2;

查看SQL的执行计划:

explain SELECT tab2.*
FROM tab2,
(SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2
FROM tab2, tab1
WHERE tab1.id = tab2.id
GROUP BY col_1) subquery1
WHERE subquery1.max_col2 = tab2.col_2;
+------------------------------------------------------------------------------------+
| Explain String |
+------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=.00MB |
| Per-Host Resource Estimates: Memory=.34GB |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| default.tab1, default.tab2 |
| |
| PLAN-ROOT SINK |
| | |
| :EXCHANGE [UNPARTITIONED] |
| | |
| :HASH JOIN [INNER JOIN, BROADCAST] |
| | hash predicates: tab2.col_2 = max(tab2.col_2) |
| | runtime filters: RF000 <- max(tab2.col_2) |
| | |
| |--:EXCHANGE [BROADCAST] |
| | | |
| | :AGGREGATE [FINALIZE] |
| | | output: max:merge(tab2.col_2) |
| | | group by: tab1.col_1 |
| | | |
| | :EXCHANGE [HASH(tab1.col_1)] |
| | | |
| | :AGGREGATE [STREAMING] |
| | | output: max(tab2.col_2) |
| | | group by: tab1.col_1 |
| | | |
| | :HASH JOIN [INNER JOIN, BROADCAST] |
| | | hash predicates: tab2.id = tab1.id |
| | | runtime filters: RF001 <- tab1.id |
| | | |
| | |--:EXCHANGE [BROADCAST] |
| | | | |
| | | :SCAN HDFS [default.tab1] |
| | | partitions=/ files= size=192B |
| | | |
| | :SCAN HDFS [default.tab2] |
| | partitions=/ files= size=158B |
| | runtime filters: RF001 -> tab2.id |
| | |
| :SCAN HDFS [default.tab2] |
| partitions=/ files= size=158B |
| runtime filters: RF000 -> tab2.col_2 |
+------------------------------------------------------------------------------------+
Fetched row(s) in .05s

创建基于Parquet的文件,并转换为内部分区表

[quickstart.cloudera:] > USE airlines_data;
Query: use airlines_data
[quickstart.cloudera:] > CREATE EXTERNAL TABLE airlines_external
> LIKE PARQUET
> 'hdfs:/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
> STORED AS PARQUET LOCATION 'hdfs:/user/impala/staging/airlines';
Query: create EXTERNAL TABLE airlines_external
LIKE PARQUET
'hdfs:/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
STORED AS PARQUET LOCATION 'hdfs:/user/impala/staging/airlines'
WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://quickstart.cloudera:8020/user/impala/staging' Fetched row(s) in .82s
[quickstart.cloudera:] > SHOW TABLE STATS airlines_external;
Query: show TABLE STATS airlines_external
+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------+
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------+
| - | | .34GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines |
+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------+
Fetched row(s) in .89s
[quickstart.cloudera:] > SHOW FILES IN airlines_external;
Query: show FILES IN airlines_external
+-----------------------------------------------------------------------------------------------------------------------+----------+-----------+
| Path | Size | Partition |
+-----------------------------------------------------------------------------------------------------------------------+----------+-----------+
| hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq | 252.99MB | |
| hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.1.parq | 13.43MB | |
| hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.0.parq | 252.84MB | |
| hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.1.parq | 63.92MB | |
| hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd985_1199995767_data.0.parq | 183.64MB | |
| hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd986_2086627597_data.0.parq | 240.04MB | |
| hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd987_1048668565_data.0.parq | 211.35MB | |
| hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd988_1432111844_data.0.parq | 151.46MB | |
+-----------------------------------------------------------------------------------------------------------------------+----------+-----------+
Fetched row(s) in .04s
[quickstart.cloudera:] > DESCRIBE airlines_external;
Query: describe airlines_external
+---------------------+--------+-----------------------------+
| name | type | comment |
+---------------------+--------+-----------------------------+
| year | int | Inferred from Parquet file. |
| month | int | Inferred from Parquet file. |
| day | int | Inferred from Parquet file. |
| dayofweek | int | Inferred from Parquet file. |
| dep_time | int | Inferred from Parquet file. |
| crs_dep_time | int | Inferred from Parquet file. |

-- 单表查询速度还是不错的。。。

[quickstart.cloudera:21000] > SELECT COUNT(*) FROM airlines_external;
Query: select COUNT(*) FROM airlines_external
Query submitted at: 2019-04-06 07:08:33 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=a04cd594518cba3a:1666b4e00000000
+-----------+
| count(*) |
+-----------+
| 123534969 |
+-----------+
Fetched 1 row(s) in 0.33s
[quickstart.cloudera:21000] > SElECT NDV(carrier), NDV(flight_num), NDV(tail_num),
> NDV(origin), NDV(dest) FROM airlines_external;
Query: select NDV(carrier), NDV(flight_num), NDV(tail_num),
NDV(origin), NDV(dest) FROM airlines_external
Query submitted at: 2019-04-06 07:08:53 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=df4776d3fa8c1f69:e4e09e6100000000
+--------------+-----------------+---------------+-------------+-----------+
| ndv(carrier) | ndv(flight_num) | ndv(tail_num) | ndv(origin) | ndv(dest) |
+--------------+-----------------+---------------+-------------+-----------+
| 29 | 8463 | 3 | 342 | 349 |
+--------------+-----------------+---------------+-------------+-----------+
Fetched 1 row(s) in 9.33s
[quickstart.cloudera:21000] > SELECT tail_num, COUNT(*) AS howmany FROM airlines_external
> GROUP BY tail_num;
Query: select tail_num, COUNT(*) AS howmany FROM airlines_external
GROUP BY tail_num
Query submitted at: 2019-04-06 07:09:19 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=7d43af89d18c600e:bc464f0f00000000
+----------+-----------+
| tail_num | howmany |
+----------+-----------+
| 715 | 1 |
| 0 | 406405 |
| 112 | 6562 |
| NULL | 123122001 |
+----------+-----------+
Fetched 4 row(s) in 2.42s

-- 大表关联,内存不足报错
[quickstart.cloudera:] > SELECT DISTINCT dest FROM airlines_external
> WHERE dest NOT IN (SELECT origin FROM airlines_external);
Query: select DISTINCT dest FROM airlines_external
WHERE dest NOT IN (SELECT origin FROM airlines_external)
Query submitted at: -- :: (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=314343c761a55f97:61ce9aa500000000
WARNINGS: Memory limit exceeded: Error occurred on backend quickstart.cloudera: by fragment 314343c761a55f97:61ce9aa500000002
Memory left in process limit: -328.00 KB
Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB
Unclaimed reservations: Reservation=34.00 MB OtherMemory= Total=34.00 MB Peak=108.00 MB
Fragment 314343c761a55f97:61ce9aa500000000: Reservation= OtherMemory=8.00 KB Total=8.00 KB Peak=8.00 KB
EXCHANGE_NODE (id=): Total= Peak=
DataStreamRecvr: Total= Peak=
PLAN_ROOT_SINK: Total= Peak=
CodeGen: Total= Peak=
Fragment 314343c761a55f97:61ce9aa500000003: Reservation= OtherMemory=38.51 KB Total=38.51 KB Peak=383.65 KB
AGGREGATION_NODE (id=): Total=21.12 KB Peak=21.12 KB
Exprs: Total=21.12 KB Peak=21.12 KB
EXCHANGE_NODE (id=): Total= Peak=
DataStreamRecvr: Total= Peak=
DataStreamSender (dst_id=): Total=7.52 KB Peak=7.52 KB
CodeGen: Total=1.86 KB Peak=347.00 KB
Fragment 314343c761a55f97:61ce9aa500000002: Reservation=374.00 MB OtherMemory=14.72 MB Total=388.72 MB Peak=388.72 MB
AGGREGATION_NODE (id=): Reservation=34.00 MB OtherMemory=5.66 MB Total=39.66 MB Peak=39.66 MB
Exprs: Total=21.12 KB Peak=21.12 KB
HASH_JOIN_NODE (id=): Reservation=340.00 MB OtherMemory=58.25 KB Total=340.06 MB Peak=340.09 MB
Exprs: Total=21.12 KB Peak=21.12 KB
Hash Join Builder (join_node_id=): Total=21.12 KB Peak=29.12 KB
Hash Join Builder (join_node_id=) Exprs: Total=21.12 KB Peak=21.12 KB
HDFS_SCAN_NODE (id=): Total=8.98 MB Peak=9.27 MB
EXCHANGE_NODE (id=): Total= Peak=
DataStreamRecvr: Total= Peak=11.65 MB
DataStreamSender (dst_id=): Total=7.52 KB Peak=7.52 KB
CodeGen: Total=12.80 KB Peak=2.00 MB
Fragment 314343c761a55f97:61ce9aa500000001: Reservation= OtherMemory= Total= Peak=9.34 MB
HDFS_SCAN_NODE (id=): Total= Peak=9.32 MB
DataStreamSender (dst_id=): Total= Peak=7.52 KB
CodeGen: Total= Peak=49.00 KBProcess: memory limit exceeded. Limit=512.00 MB Total=512.32 MB Peak=512.32 MB
Buffer Pool: Free Buffers: Total=260.00 MB
Buffer Pool: Clean Pages: Total=40.00 MB
Buffer Pool: Unused Reservation: Total=-300.00 MB
RequestPool=fe-eval-exprs: Total= Peak=4.00 KB
RequestPool=root.root: Total= Peak=139.93 MB
RequestPool=root.cloudera: Total=184.00 B Peak=431.27 KB
Query(a34bd5934157257d:2e53c5ce00000000): Reservation= ReservationLimit=409.60 MB OtherMemory=184.00 B Total=184.00 B Peak=431.27 KB
RequestPool=root.hdfs: Total=422.76 MB Peak=423.60 MB
Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB
Untracked Memory: Total=89.56 MB WARNING: The following tables are missing relevant table and/or column statistics.
airlines_data.airlines_external
Memory limit exceeded: Error occurred on backend quickstart.cloudera: by fragment 314343c761a55f97:61ce9aa500000002
Memory left in process limit: -328.00 KB
Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB
Unclaimed reservations: Reservation=34.00 MB OtherMemory= Total=34.00 MB Peak=108.00 MB
Fragment 314343c761a55f97:61ce9aa500000000: Reservation= OtherMemory=8.00 KB Total=8.00 KB Peak=8.00 KB
EXCHANGE_NODE (id=): Total= Peak=
DataStreamRecvr: Total= Peak=
PLAN_ROOT_SINK: Total= Peak=
CodeGen: Total= Peak=
Fragment 314343c761a55f97:61ce9aa500000003: Reservation= OtherMemory=38.51 KB Total=38.51 KB Peak=383.65 KB
AGGREGATION_NODE (id=): Total=21.12 KB Peak=21.12 KB
Exprs: Total=21.12 KB Peak=21.12 KB
EXCHANGE_NODE (id=): Total= Peak=
DataStreamRecvr: Total= Peak=
DataStreamSender (dst_id=): Total=7.52 KB Peak=7.52 KB
CodeGen: Total=1.86 KB Peak=347.00 KB
Fragment 314343c761a55f97:61ce9aa500000002: Reservation=374.00 MB OtherMemory=14.72 MB Total=388.72 MB Peak=388.72 MB
AGGREGATION_NODE (id=): Reservation=34.00 MB OtherMemory=5.66 MB Total=39.66 MB Peak=39.66 MB
Exprs: Total=21.12 KB Peak=21.12 KB
HASH_JOIN_NODE (id=): Reservation=340.00 MB OtherMemory=58.25 KB Total=340.06 MB Peak=340.09 MB
Exprs: Total=21.12 KB Peak=21.12 KB
Hash Join Builder (join_node_id=): Total=21.12 KB Peak=29.12 KB
Hash Join Builder (join_node_id=) Exprs: Total=21.12 KB Peak=21.12 KB
HDFS_SCAN_NODE (id=): Total=8.98 MB Peak=9.27 MB
EXCHANGE_NODE (id=): Total= Peak=
DataStreamRecvr: Total= Peak=11.65 MB
DataStreamSender (dst_id=): Total=7.52 KB Peak=7.52 KB
CodeGen: Total=12.80 KB Peak=2.00 MB
Fragment 314343c761a55f97:61ce9aa500000001: Reservation= OtherMemory= Total= Peak=9.34 MB
HDFS_SCAN_NODE (id=): Total= Peak=9.32 MB
DataStreamSender (dst_id=): Total= Peak=7.52 KB
CodeGen: Total= Peak=49.00 KBProcess: memory limit exceeded. Limit=512.00 MB Total=512.32 MB Peak=512.32 MB
Buffer Pool: Free Buffers: Total=260.00 MB
Buffer Pool: Clean Pages: Total=40.00 MB
Buffer Pool: Unused Reservation: Total=-300.00 MB
RequestPool=fe-eval-exprs: Total= Peak=4.00 KB
RequestPool=root.root: Total= Peak=139.93 MB
RequestPool=root.cloudera: Total=184.00 B Peak=431.27 KB
Query(a34bd5934157257d:2e53c5ce00000000): Reservation= ReservationLimit=409.60 MB OtherMemory=184.00 B Total=184.00 B Peak=431.27 KB
RequestPool=root.hdfs: Total=422.76 MB Peak=423.60 MB
Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB
Untracked Memory: Total=89.56 MB ( of similar)

impala-server:25000可以查看语句执行进度:

CREATE TABLE airlines_data.airlines
(month INT,
day INT,
dayofweek INT,
dep_time INT,
crs_dep_time INT,
arr_time INT,
crs_arr_time INT,
carrier STRING,
flight_num INT,
actual_elapsed_time INT,
crs_elapsed_time INT,
airtime INT,
arrdelay INT,
depdelay INT,
origin STRING,
dest STRING,
distance INT,
taxi_in INT,
taxi_out INT,
cancelled INT,
cancellation_code STRING,
diverted INT,
carrier_delay INT,
weather_delay INT,
nas_delay INT,
security_delay INT,
late_aircraft_delay INT)
PARTITIONED BY (year INT)
STORED AS PARQUET; INSERT INTO airlines_data.airlines
PARTITION (year)
SELECT
month,
day,
dayofweek,
dep_time,
crs_dep_time,
arr_time,
crs_arr_time,
carrier,
flight_num,
actual_elapsed_time,
crs_elapsed_time,
airtime,
arrdelay,
depdelay,
origin,
dest,
distance,
taxi_in,
taxi_out,
cancelled,
cancellation_code,
diverted,
carrier_delay,
weather_delay,
nas_delay,
security_delay,
late_aircraft_delay,
year
FROM airlines_data.airlines_external limit ; -- 1亿的时候内存不足
[quickstart.cloudera:] > SHOW TABLE STATS airlines;
Query: show TABLE STATS airlines
+-------+---------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------------------------------+
| year | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-------+---------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------------------------------+
| | | | .62MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://quickstart.cloudera:8020/user/hive/warehouse/airlines_data.db/airlines/year=1990 |
| | | | .13KB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://quickstart.cloudera:8020/user/hive/warehouse/airlines_data.db/airlines/year=2002 |
| | | | .76KB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://quickstart.cloudera:8020/user/hive/warehouse/airlines_data.db/airlines/year=2003 |
| Total | | | .73MB | 0B | | | | |
+-------+---------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------------------------------+
Fetched row(s) in .14s
[quickstart.cloudera:] > COMPUTE INCREMENTAL STATS airlines;
Query: compute INCREMENTAL STATS airlines
WARNINGS: No partitions selected for incremental stats update Fetched row(s) in .01s
[quickstart.cloudera:] > COMPUTE STATS airlines;
Query: compute STATS airlines
+------------------------------------------+
| summary |
+------------------------------------------+
| Updated partition(s) and column(s). |
+------------------------------------------+
Fetched row(s) in .02s

TPCDS:https://github.com/cloudera/impala-tpcds-kit/tree/master/tpcds-gen(可生成TPC-DS测试数据集、同时包含TPC-DS测试,可生成10TB级别)

官方自带的tpcds-kit customer数据无效了(主要是http://www.tpc.org/tpcds/dsgen/dsgen-download-files.asp链接无效了),脚本在https://github.com/sleberknight/impalascripts-0.6/blob/master/tpcds-setup.sh。

常见问题:

在使用Hadoop建立文件的时候,出现“Cannot create directory /user/hadoop/input. Name node is in safe mode.”问题的原因及解决方案
解决方法:https://www.waitig.com/hadoop-name-node-is-in-safe-mode.html

[cloudera@quickstart ~]$ hdfs dfs -put tab1.csv /user/cloudera/sample_data/tab1
put: Permission denied: user=cloudera, access=WRITE, inode="/user/cloudera/sample_data/tab1":hdfs:cloudera:drwxr-xr-x
切换到HDFS用户即可,如下:
-bash-4.1$ hdfs dfs -put tab1.csv /user/cloudera/sample_data/tab1
-bash-4.1$ hdfs dfs -put tab2.csv /user/cloudera/sample_data/tab2
-bash-4.1$ hdfs dfs -ls /user/cloudera/sample_data/tab1
Found 1 items
-rw-r--r-- 1 hdfs cloudera 192 2019-04-05 23:06 /user/cloudera/sample_data/tab1/tab1.csv

WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://quickstart.cloudera:8020/user/cloudera/sample_data'
报错分析:
impala-shell运行的时候使用Impala用户,impala对hfds路径没有读写权限;

问题处理:
方法一:对hdfs的目录进行赋权:hadoop fs -chomd -R 777 path
-bash-4.1$ hadoop fs -chmod -R 777 /user/cloudera/sample_data
-bash-4.1$ exit
logout

方法二:创建hadoop用户组,然后将impala加入到hadoop用户组中,同时给impala用户定制权限

[quickstart.cloudera:21000] > CREATE EXTERNAL TABLE airlines_external
> LIKE PARQUET
> 'hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
> STORED AS PARQUET LOCATION 'hdfs:staging/airlines';
Query: create EXTERNAL TABLE airlines_external
LIKE PARQUET
'hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
STORED AS PARQUET LOCATION 'hdfs:staging/airlines'
ERROR: AnalysisException: null
CAUSED BY: IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq
CAUSED BY: URISyntaxException: Relative path in absolute URI: hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq

解决方法:

路径无效,使用正确的路径,如hdfs:/user/impala/staging/airlines/XXX

Hadoop-Impala学习笔记之入门的更多相关文章

  1. python学习笔记--Django入门四 管理站点--二

    接上一节  python学习笔记--Django入门四 管理站点 设置字段可选 编辑Book模块在email字段上加上blank=True,指定email字段为可选,代码如下: class Autho ...

  2. WebSocket学习笔记——无痛入门

    WebSocket学习笔记——无痛入门 标签: websocket 2014-04-09 22:05 4987人阅读 评论(1) 收藏 举报  分类: 物联网学习笔记(37)  版权声明:本文为博主原 ...

  3. impala学习笔记

    impala学习笔记 -- 建库 CREATE DATABASE IF NOT EXISTS database_name; -- 在HDFS文件系统中创建数据库,需要指定要创建数据库的位置. CREA ...

  4. Java学习笔记之---入门

    Java学习笔记之---入门 一. 为什么要在众多的编程语言中选择Java? java是一种纯面向对象的编程语言 java学习起来比较简单,适合初学者使用 java可以跨平台,即在Windows操作系 ...

  5. DBFlow框架的学习笔记之入门

    什么是DBFlow? dbflow是一款android高性的ORM数据库.可以使用在进行项目中有关数据库的操作.github下载源码 1.环境配置 先导入 apt plugin库到你的classpat ...

  6. MongoDB学习笔记:快速入门

    MongoDB学习笔记:快速入门   一.MongoDB 简介 MongoDB 是由C++语言编写的,是一个基于分布式文件存储的开源数据库系统.在高负载的情况下,添加更多的节点,可以保证服务器性能.M ...

  7. 学习笔记_J2EE_SpringMVC_01_入门

    1.    概述 笔者作为一个不太正经的不专业佛教信仰者,习惯了解事物的因果关系,所以概述就有点BBB...了.如果不喜欢这些的,请自行跳过概述章节,直接进入第二章的操作实践:2 入门示例. 1.1. ...

  8. dubbo入门学习笔记之入门demo(基于普通maven项目)

    注:本笔记接dubbo入门学习笔记之环境准备继续记录; (四)开发服务提供者和消费者并让他们在启动时分别向注册中心注册和订阅服务 需求:订单服务中初始化订单功能需要调用用户服务的获取用户信息的接口(订 ...

  9. SpringBoot学习笔记<一>入门与基本配置

    毕业实习项目技术学习笔记 参考文献 学习视频 2小时学会Spring Boot:https://www.imooc.com/learn/767 学习资料 SpringBoot入门:https://bl ...

随机推荐

  1. maven打包忽略静态资源解决办法,dispatchServlet拦截静态资源请求的解决办法

    问题: maven 打包时,有的文件打不进去target 解决: 因为maven打包默认打Java文件.在项目中的pom文件中加build标签 <build> <resources& ...

  2. python 配置文件返回的两种方式,写法不一样而已

    配置文件如下: [MODE]mode:{ "register":"all"} 或者 mode = {"register":"all ...

  3. sqlserver 表循环-游标、表变量、临时表

    SQL Server遍历表的几种方法 阅读目录 使用游标 使用表变量 使用临时表 在数据库开发过程中,我们经常会碰到要遍历数据表的情形,一提到遍历表,我们第一印象可能就想到使用游标,使用游标虽然直观易 ...

  4. 微信小程序--地图组件与api-模拟器上返回的scale 与真机上不同--bindregionchange触发图标一直闪现问题

    场景:根据地理定位获取不同地区的充电桩位置,要求 1.平移的时候,跟随坐标变化展示不同区域的坐标点信息 2.不同的缩放等级,14以下,展示聚合点数据,14以上,展示真正的站点信息: 3.点击聚合点的时 ...

  5. WSL(Windows Subsystem for Linux)笔记一安装与使用

    1.安装linux子系统 很简单直接在启动或关闭windows功能 中选择“适用于linux的windows子系统”,确定安装后重启即可,安装还是比较快的只用了几分钟. 也可以直接使用shell命令行 ...

  6. loj2083 优秀的拆分 [NOI2016] SA

    正解:SA 解题报告: 我永远喜欢loj! 显然$AABB$串相当于是由两个$AA$串拼起来的,所以可以先考虑如果求出来了所有$AA$串怎么求答案? 就假如能统计出$st[i]$表示所有以$i$为开头 ...

  7. asp.net 使用rabbitmq事例

    本例asp.net 使用rabbitmq需求背景:为了提升用户体验,用户点击下单按钮,后台先做一些简单必要的操作,返回给用户一个友好提示(比如提示处理中,或者订单状态为处理中),然后发通过发消息给队列 ...

  8. 学号20175313 《实现Linux下cp XXX1 XXX2的功能(二)》第九周

    目录 MyCP2 一.题目要求 二.题目理解 三.需求分析 四.设计思路 五.伪代码分析 六.代码链接 七.代码实现过程中遇到的问题 八.运行结果截图 九.心得体会 十.参考资料 MyCP2 一.题目 ...

  9. RS232通信(Android)

    一. 添加依赖dependencies { implementation 'com.github.kongqw:AndroidSerialPort:1.0.1'} 二. 使用方法 package co ...

  10. Oracle SQL 部分特殊字符转义及escape的用法

    在处理sql时,遇到insert 或update 的参数中含有特殊字符“&”,下划线“_”, 单引号" ' "等时,要做转义处理. 例:插入特殊字元'&' upda ...