Hello World on Impala

Cloudera Impala 官方教程《Impala
Tutorial》，解说了Impala一些基本操作，但操作步骤前后缺少连贯性，本文节W选《Impala Tutorial》中的部分演示样例，从零開始解说了一个完整演示样例：创建表、载入数据、查询数据。提供了一个入门级教程，通过本文的操作，向Impala说“Hello World”。

本文如果你已经具备了安装好的Impala环境，环境搭建能够參考： CDH5上安装Hive,HBase,Impala,Spark等服务

创建cloudera用户和组

Impala Tutorial中演示样例的登录username为cloudera，但Cloudera Manager 5.0.2 安装时并没有自己主动在主机节点（比如：h1.worker.com)上创建cloudera用户，为了和Impala Tutorial 中演示样例一致，须要手工创建cloudera用户和组。

以root用户身份登录主机节点（比如：h1.worker.com)，先检查下是否存在cloudera用户，运行例如以下的命令：

[root@h1 home]# cat /etc/passwd | grep cloudera

cloudera-scm:x:496:493:Cloudera Manager:/var/run/cloudera-scm-server:/sbin/nologin

上面显示不存在cloudera用户。假设存在，则不须要进行以下的创建用户步骤了。

创建cloudera用户和组，并设置password为cloudera：

[root@h1 home]# groupadd cloudera

[root@h1 home]# useradd -g cloud era cloudera

[root@h1 home]# passwd cloudera

Changing password for user cloudera.an

New password:

BAD PASSWORD: it is based on a dictionary word

Retype new password:

passwd: all authentication tokens updated successfully.

在HDFS上创建/user/cloudera目录

我们须要在HDFS上新建/user/cloudera目录，并将这个目录的全部者改动为cloudera，这须要HDFS的超级用户才有权限执行这些操作。HDFS的超级用户即执行name node进程的用户。宽泛的讲，假设你启动了name node，你就是超级用户。通过Cloudera Manager 5安装环境的超级username为：hdfs

切换到HDFS的超级用户，先检查是否存在 /user/cloudera 目录，假设不存在则创建。

[root@h1 home]# su - hdfs

-bash-4.1$ hdfs dfs -ls /user

Found 7 items

drwx------   - hdfs   supergroup          0 2014-06-26 08:44 /user/hdfs

drwxrwxrwx   - mapred hadoop              0 2014-06-20 10:10 /user/history

drwxrwxr-t   - hive   hive                0 2014-06-20 10:13 /user/hive

drwxrwxr-x   - impala impala              0 2014-06-20 10:18 /user/impala

drwxrwxr-x   - oozie  oozie               0 2014-06-20 10:15 /user/oozie

drwxr-x--x   - spark  spark               0 2014-06-20 10:08 /user/spark

drwxrwxr-x   - sqoop2 sqoop               0 2014-06-20 10:16 /user/sqoop2

在HDFS上创建 /user/cloudera 文件夹，设置文件夹的全部者和组为cloudera

-bash-4.1$ hdfs dfs -mkdir -p /user/cloudera

-bash-4.1$ hdfs dfs -chown cloudera:cloudera /user/cloudera

-bash-4.1$ hdfs dfs -ls /user

Found 8 items

drwxr-xr-x   - cloudera cloudera            0 2014-06-26 09:05 /user/cloudera

drwx------   - hdfs     supergroup          0 2014-06-26 08:44 /user/hdfs

drwxrwxrwx   - mapred   hadoop              0 2014-06-20 10:10 /user/history

drwxrwxr-t   - hive     hive                0 2014-06-20 10:13 /user/hive

drwxrwxr-x   - impala   impala              0 2014-06-20 10:18 /user/impala

drwxrwxr-x   - oozie    oozie               0 2014-06-20 10:15 /user/oozie

drwxr-x--x   - spark    spark               0 2014-06-20 10:08 /user/spark

drwxrwxr-x   - sqoop2   sqoop               0 2014-06-20 10:16 /user/sqoop2

经过以上的操作已经具备了执行 Impala Tutorial中演示样例的条件。

HDFS上创建装载表数据的文件夹

本节演示怎样创建一些很小的表，适合初次使用的用户实验 Impala SQL 功能。 TAB1 和 TAB2 从 HDFS 文件里加载数据。能够把你想查询的数据放入 HDFS 中。想開始这一过程，先在你的 HDFS 用户文件夹下创建一个或多个子文件夹。每一个表中的数据存放在单独的子文件夹里。这个样例使用 mkdir
中的 -p 选项，这样假设不存在的父文件夹中则自己主动创建。

[root@h1 ~]# su - cloudera

[cloudera@h1 ~]$ whoami

cloudera

[cloudera@h1 ~]$ hdfs dfs -ls /user

Found 8 items

drwxr-xr-x   - cloudera cloudera            0 2014-06-26 09:05 /user/cloudera

drwx------   - hdfs     supergroup          0 2014-06-26 08:44 /user/hdfs

drwxrwxrwx   - mapred   hadoop              0 2014-06-20 10:10 /user/history

drwxrwxr-t   - hive     hive                0 2014-06-20 10:13 /user/hive

drwxrwxr-x   - impala   impala              0 2014-06-20 10:18 /user/impala

drwxrwxr-x   - oozie    oozie               0 2014-06-20 10:15 /user/oozie

drwxr-x--x   - spark    spark               0 2014-06-20 10:08 /user/spark

drwxrwxr-x   - sqoop2   sqoop               0 2014-06-20 10:16 /user/sqoop2

[cloudera@h1 ~]$ hdfs dfs -mkdir -p /user/cloudera/sample_data/tab1 /user/cloudera/sample_data/tab2

[cloudera@h1 ~]$

通过以上的操作，就创建了存放TAB1 和 TAB2表数据的文件夹。

csv文件存放到HDFS文件夹

拷贝例如以下的两个.csv文件到本地的文件系统。

tab1.csv:

1,true,123.123,2012-10-24 08:55:00

2,false,1243.5,2012-10-25 13:40:00

3,false,24453.325,2008-08-22 09:33:21.123

4,false,243423.325,2007-05-12 22:32:21.33454

5,true,243.325,1953-04-22 09:11:33

tab2.csv:

1,true,12789.123

2,false,1243.5

3,false,24453.325

4,false,2423.3254

5,true,243.325

60,false,243565423.325

70,true,243.325

80,false,243423.325

90,true,243.325

运行以下的命令将两个 .csv 文件放入单独的 HDFS 文件夹：

[cloudera@h1 testdata]$ pwd

/home/cloudera/testdata

[cloudera@h1 testdata]$ ll

total 8

-rw-rw-r--. 1 cloudera cloudera 193 Jun 27 08:33 tab1.csv

-rw-rw-r--. 1 cloudera cloudera 158 Jun 27 08:34 tab2.csv

[cloudera@h1 testdata]$ hdfs dfs -put tab1.csv /user/cloudera/sample_data/tab1

[cloudera@h1 testdata]$ hdfs dfs -ls /user/cloudera/sample_data/tab1

Found 1 items

-rw-r--r--   3 cloudera cloudera        193 2014-06-27 08:35 /user/cloudera/sample_data/tab1/tab1.csv

[cloudera@h1 testdata]$ hdfs dfs -put tab2.csv /user/cloudera/sample_data/tab2

[cloudera@h1 testdata]$ hdfs dfs -ls /user/cloudera/sample_data/tab2

Found 1 items

-rw-r--r--   3 cloudera cloudera        158 2014-06-27 08:36 /user/cloudera/sample_data/tab2/tab2.csv

[cloudera@h1 testdata]$

每一个数据文件的名称不重要。其实，当 Impala 第一次检測数据文件夹的内容时，它觉得文件夹下的全部文件都是表中的数据文件，不管文件夹下有多少文件，不管什么样的文件名称。

要了解你的 HDFS 文件系统中什么文件夹可用，不同的文件夹和文件都有什么权限，运行 hdfs dfs -ls / 并沿着看到的文件夹树一直运行 -ls 操作。

创建表，载入数据

使用 impala-shell 命令创建表，能够用交互式创建，也能够用 SQL 脚本。

以下的样例演示创建了三个表。每一个表中的列都使用了不同的数据类型，如 Boolean 或 integer。样例还包括了怎样格式数据的命令，比如列以逗号分隔，这样从 .csv 文件导入数据。我们已经有了存放在 HDFS 文件夹树中的包括数据的 .csv 文件，我们给表指定了包括相应 .csv 文件的路径位置。Impala 觉得这些文件夹下的全部文件中的全部数据都是表里的数据。

table_setup.sql 文件包括例如以下内容:

DROP TABLE IF EXISTS tab1;

-- The EXTERNAL clause means the data is located outside the central location for Impala data files

-- and is preserved when the associated Impala table is dropped. We expect the data to already

-- exist in the directory specified by the LOCATION clause.

CREATE EXTERNAL TABLE tab1

(

   id INT,

   col_1 BOOLEAN,

   col_2 DOUBLE,

   col_3 TIMESTAMP

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

LOCATION '/user/cloudera/sample_data/tab1';

DROP TABLE IF EXISTS tab2;

-- TAB2 is an external table, similar to TAB1.

CREATE EXTERNAL TABLE tab2

(

   id INT,

   col_1 BOOLEAN,

   col_2 DOUBLE

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

LOCATION '/user/cloudera/sample_data/tab2';

DROP TABLE IF EXISTS tab3;

-- Leaving out the EXTERNAL clause means the data will be managed

-- in the central Impala data directory tree. Rather than reading

-- existing data files when the table is created, we load the

-- data after creating the table.

CREATE TABLE tab3

(

   id INT,

   col_1 BOOLEAN,

   col_2 DOUBLE,

   month INT,

   day INT

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

运行 table_setup.sql 脚本，使用：

impala-shell -i 172.16.230.152 -f table_setup.sql

操作过程例如以下：

[cloudera@h1 testdata]$ pwd

/home/cloudera/testdata

[cloudera@h1 testdata]$ ll

total 12

-rw-rw-r--. 1 cloudera cloudera  193 Jun 27 08:33 tab1.csv

-rw-rw-r--. 1 cloudera cloudera  158 Jun 27 08:34 tab2.csv

-rw-rw-r--. 1 cloudera cloudera 1106 Jun 27 08:49 table_setup.sql

[cloudera@h1 testdata]$ impala-shell -i 172.16.230.152 -f table_setup.sql

Starting Impala Shell without Kerberos authentication

Connected to 172.16.230.152:21000

Server version: impalad version 1.3.1-cdh5 RELEASE (build )

...

...

Returned 0 row(s) in 0.28s

[cloudera@h1 testdata]$

查看 Impala 表结构

登录impala-shell，运行以下的命令：

show tables;

describe tab1;

操作过程例如以下：

[cloudera@h1 testdata]$ impala-shell -i 172.16.230.152

Starting Impala Shell without Kerberos authentication

Connected to 172.16.230.152:21000

Server version: impalad version 1.3.1-cdh5 RELEASE (build )

Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Shell build version: Impala Shell v1.3.1-cdh5 () built on Mon Jun  9 09:30:26 PDT 2014)

[172.16.230.152:21000] > show tables;

Query: show tables

+------+

| name |

+------+

| tab1 |

| tab2 |

| tab3 |

+------+

Returned 3 row(s) in 0.01s

[172.16.230.152:21000] > describe tab1;

Query: describe tab1

+-------+-----------+---------+

| name  | type      | comment |

+-------+-----------+---------+

| id    | int       |         |

| col_1 | boolean   |         |

| col_2 | double    |         |

| col_3 | timestamp |         |

+-------+-----------+---------+

Returned 4 row(s) in 6.85s

[172.16.230.152:21000] > quit;

Goodbye

[cloudera@h1 testdata]$

查询 Impala 表

登录impala-shell，运行例如以下的sql语句：

SELECT * FROM tab1;

SELECT * FROM tab2 LIMIT 5;

SELECT tab2.*

FROM tab2,

(SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2

FROM tab2, tab1

WHERE tab1.id = tab2.id

GROUP BY col_1) subquery1

WHERE subquery1.max_col2 = tab2.col_2;

操作过程例如以下：

[cloudera@h1 testdata]$ impala-shell -i 172.16.230.152

Starting Impala Shell without Kerberos authentication

Connected to 172.16.230.152:21000

Server version: impalad version 1.3.1-cdh5 RELEASE (build )

Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Shell build version: Impala Shell v1.3.1-cdh5 () built on Mon Jun  9 09:30:26 PDT 2014)

[172.16.230.152:21000] > SELECT * FROM tab1;

Query: select * FROM tab1

+----+-------+------------+-------------------------------+

| id | col_1 | col_2      | col_3                         |

+----+-------+------------+-------------------------------+

| 1  | true  | 123.123    | 2012-10-24 08:55:00           |

| 2  | false | 1243.5     | 2012-10-25 13:40:00           |

| 3  | false | 24453.325  | 2008-08-22 09:33:21.123000000 |

| 4  | false | 243423.325 | 2007-05-12 22:32:21.334540000 |

| 5  | true  | 243.325    | 1953-04-22 09:11:33           |

+----+-------+------------+-------------------------------+

Returned 5 row(s) in 2.39s

[172.16.230.152:21000] > SELECT * FROM tab2 LIMIT 5;

Query: select * FROM tab2 LIMIT 5

+----+-------+-----------+

| id | col_1 | col_2     |

+----+-------+-----------+

| 1  | true  | 12789.123 |

| 2  | false | 1243.5    |

| 3  | false | 24453.325 |

| 4  | false | 2423.3254 |

| 5  | true  | 243.325   |

+----+-------+-----------+

Returned 5 row(s) in 1.30s

[172.16.230.152:21000] > SELECT tab2.*

                       > FROM tab2,

                       > (SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2

                       >  FROM tab2, tab1

                       >  WHERE tab1.id = tab2.id

                       >  GROUP BY col_1) subquery1

                       > WHERE subquery1.max_col2 = tab2.col_2;

Query: select tab2.* FROM tab2, (SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2 FROM tab2, tab1 WHERE tab1.id = tab2.id GROUP BY col_1) subquery1 WHERE subquery1.max_col2 = tab2.col_2

+----+-------+-----------+

| id | col_1 | col_2     |

+----+-------+-----------+

| 1  | true  | 12789.123 |

| 3  | false | 24453.325 |

+----+-------+-----------+

Returned 2 row(s) in 1.02s

[172.16.230.152:21000] > quit;

Goodbye

[cloudera@h1 testdata]$

结束语：

本文解说了一个Impala使用的基本演示样例，提供了一个入门指导，很多其它的演示样例參见：Impala Tutorial

本文使用了很多 impala-shell 命令的方法，详细參见 Using the Impala Shell (impala-shell Command)

原创作品，转载请注明出处 http://blog.csdn.net/yangzhaohui168/article/details/35340387