Querying and Inserting Data

The Hive query operations are documented in Select, and the insert operations are documented in Inserting data into Hive Tables from queries and Writing data into the filesystem from queries.

Simple Query 简单查询

For all the active users, one can use the query of the following form:

INSERT OVERWRITE TABLE user_active
SELECT user.*
FROM user
WHERE user.active = 1;

Note that unlike SQL, we always insert the results into a table. We will illustrate later how the user can inspect these results and even dump them to a local file. You can also run the following query on Hive CLI:

SELECT user.*
FROM user
WHERE user.active = 1;

This will be internally rewritten to some temporary file and displayed to the Hive client side.

Partition Based Query 分区查询

What partitions to use in a query is determined automatically by the system on the basis of where clause conditions on partition columns. For example, in order to get all the page_views in the month of 03/2008 referred from domain xyz.com, one could write the following query:

INSERT OVERWRITE TABLE xyz_com_page_views
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND
      page_views.referrer_url like '%xyz.com';

Note that page_views.date is used here because the table (above) was defined with PARTITIONED BY(date DATETIME, country STRING) ; if you name your partition something different, don't expect .date to do what you think!

Joins  表连接

In order to get a demographic breakdown (by gender) of page_view of 2008-03-03 one would need to join the page_view table and the user table on the userid column. This can be accomplished with a join as shown in the following query:

INSERT OVERWRITE TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM user u JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

In order to do outer joins the user can qualify the join with LEFT OUTER, RIGHT OUTER or FULL OUTER keywords in order to indicate the kind of outer join (left preserved, right preserved or both sides preserved). For example, in order to do a full outer join in the query above, the corresponding syntax would look like the following query:

INSERT OVERWRITE TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

In order check the existence of a key in another table, the user can use LEFT SEMI JOIN as illustrated by the following example.

INSERT OVERWRITE TABLE pv_users
SELECT u.*
FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

In order to join more than one tables, the user can use the following syntax:

INSERT OVERWRITE TABLE pv_friends
SELECT pv.*, u.gender, u.age, f.friends
FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON (u.id = f.uid)
WHERE pv.date = '2008-03-03';

Note that Hive only supports equi-joins. Also it is best to put the largest table on the rightmost side of the join to get the best performance.

目前Hive只支持等连接,并且最好将达标放到表连接的右边。

Aggregations  聚合操作

In order to count the number of distinct users by gender one could write the following query:

INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns .e.g while the following is possible

DISTINCT关键在不允许在两个聚合函数里,针对不同的列。

INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

however, the following query is not allowed

下面这种写法是不允许的:

INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
FROM pv_users
GROUP BY pv_users.gender;

Multi Table/File Inserts  多表插入和文件插入

The output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilities). e.g. if along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query:

FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
    SELECT pv_users.gender, count_distinct(pv_users.userid)
    GROUP BY pv_users.gender
 
INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
    SELECT pv_users.age, count_distinct(pv_users.userid)
    GROUP BY pv_users.age;

The first insert clause sends the results of the first group by to a Hive table while the second one sends the results to a hadoop dfs files.

Dynamic-Partition Insert 动态分区插入

In the previous examples, the user has to know which partition to insert into and only one partition can be inserted in one insert statement. If you want to load into multiple partitions, you have to use multi-insert statement as illustrated below.

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, nullnull, pvs.ip WHERE pvs.country = 'US'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA')
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, nullnull, pvs.ip WHERE pvs.country = 'CA'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK')
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, nullnull, pvs.ip WHERE pvs.country = 'UK';

In order to load data into all country partitions in a particular day, you have to add an insert statement for each country in the input data. This is very inconvenient since you have to have the priori knowledge of the list of countries exist in the input data and create the partitions beforehand. If the list changed for another day, you have to modify your insert DML as well as the partition creation DDLs. It is also inefficient since each insert statement may be turned into a MapReduce Job.

Dynamic-partition insert (or multi-partition insert) is designed to solve this problem by dynamically determining which partitions should be created and populated while scanning the input table. This is a newly added feature that is only available from version 0.6.0. In the dynamic partition insert, the input column values are evaluated to determine which partition this row should be inserted into. If that partition has not been created, it will create that partition automatically. Using this feature you need only one insert statement to create and populate all necessary partitions. In addition, since there is only one insert statement, there is only one corresponding MapReduce job. This significantly improves performance and reduce the Hadoop cluster workload comparing to the multiple insert case.

Below is an example of loading data to all country partitions using one insert statement:

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, nullnull, pvs.ip, pvs.country

There are several syntactic differences from the multi-insert statement:

  • country appears in the PARTITION specification, but with no value associated. In this case, country is a dynamic partition column. On the other hand, ds has a value associated with it, which means it is a static partition column. If a column is dynamic partition column, its value will be coming from the input column. Currently we only allow dynamic partition columns to be the last column(s) in the partition clause because the partition column order indicates its hierarchical order (meaning dt is the root partition, and country is the child partition). 目前只支持最后一个分区列为动态分区列。You cannot specify a partition clause with (dt, country='US') because that means you need to update all partitions with any date and its country sub-partition is 'US'.
  • An additional pvs.country column is added in the select statement. This is the corresponding input column for the dynamic partition column. Note that you do not need to add an input column for the static partition column because its value is already known in the PARTITION clause. Note that the dynamic partition values are selected by ordering, not name, and taken as the last columns from the select clause.

Semantics of the dynamic partition insert statement:

  • When there are already non-empty partitions exists for the dynamic partition columns, (e.g., country='CA' exists under some ds root partition), it will be overwritten if the dynamic partition insert saw the same value (say 'CA') in the input data. This is in line with the 'insert overwrite' semantics. However, if the partition value 'CA' does not appear in the input data, the existing partition will not be overwritten.如果有对应的分区数据,则会被覆盖。没有则不会被覆盖。
  • Since a Hive partition corresponds to a directory in HDFS, the partition value has to conform to the HDFS path format (URI in Java). Any character having a special meaning in URI (e.g., '%', ':', '/', '#') will be escaped with '%' followed by 2 bytes of its ASCII value.
  • If the input column is a type different than STRING, its value will be first converted to STRING to be used to construct the HDFS path.
  • If the input column value is NULL or empty string, the row will be put into a special partition, whose name is controlled by the hive parameter hive.exec.default.partition.name. The default value is HIVE_DEFAULT_PARTITION{}. Basically this partition will contain all "bad" rows whose value are not valid partition names. The caveat of this approach is that the bad value will be lost and is replaced by HIVE_DEFAULT_PARTITION{} if you select them Hive. JIRA HIVE-1309 is a solution to let user specify "bad file" to retain the input partition column values as well. (空值使用默认的partition名字)
  • Dynamic partition insert could potentially resource hog in that it could generate a large number of partitions in a short time. To get yourself buckled, we define three parameters:(因为动态分区插入过程中,会在短时间内产生大量的partition,可以通过一下参数进行控制:)
    • hive.exec.max.dynamic.partitions.pernode (default value being 100) is the maximum dynamic partitions that can be created by each mapper or reducer. If one mapper or reducer created more than that the threshold, a fatal error will be raised from the mapper/reducer (through counter) and the whole job will be killed. (单个Mapper或者Redcuer最大的动态Partition数量)
    • hive.exec.max.dynamic.partitions (default value being 1000) is the total number of dynamic partitions could be created by one DML. If each mapper/reducer did not exceed the limit but the total number of dynamic partitions does, then an exception is raised at the end of the job before the intermediate data are moved to the final destination.(在最后一个JOB,执行中间数据迁移到最终目的地的时候,如果一个DML语句中所有的Mapper/Reducer产生的动态分区之后大于设定值,则会抛出异常。)
    • hive.exec.max.created.files (default value being 100000) is the maximum total number of files created by all mappers and reducers. This is implemented by updating a Hadoop counter by each mapper/reducer whenever a new file is created. If the total number is exceeding hive.exec.max.created.files, a fatal error will be thrown and the job will be killed.(该参数定义为:所有的Mapper和Reducer产生的文件数量不能超过此参数设定值,否则当前JOB会被KILL)
  • Another situation we want to protect against dynamic partition insert is that the user may accidentally specify all partitions to be dynamic partitions without specifying one static partition, while the original intention is to just overwrite the sub-partitions of one root partition. We define another parameter hive.exec.dynamic.partition.mode=strict to prevent the all-dynamic partition case. In the strict mode, you have to specify at least one static partition. The default mode is strict. In addition, we have a parameter hive.exec.dynamic.partition=true/false to control whether to allow dynamic partition at all. The default value is false.(为了防止用户不小心将所有的分区设置成动态分区,没有静态分区,通过hive.exec.dynamic.partition.mode=strict参数来解决此问题,当设置成strict时候,用户至少指定一个静态分区。
  • In Hive 0.6, dynamic partition insert does not work with hive.merge.mapfiles=true or hive.merge.mapredfiles=true, so it internally turns off the merge parameters. Merging files in dynamic partition inserts are supported in Hive 0.7 (see JIRA HIVE-1307 for details).

Troubleshooting and best practices: 问题解决及最佳实践

  • As stated above, there are too many dynamic partitions created by a particular mapper/reducer, a fatal error could be raised and the job will be killed. The error message looks something like:

        hive> set hive.exec.dynamic.partition.mode=nonstrict;
        hive> FROM page_view_stg pvs
              INSERT OVERWRITE TABLE page_view PARTITION(dt, country)
                     SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, nullnull, pvs.ip,
                            from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country;
    ...
    2010-05-07 11:10:19,816 Stage-1 map = 0%,  reduce = 0%
    [Fatal Error] Operator FS_28 (id=41): fatal error. Killing the job.
    Ended Job = job_201005052204_28178 with errors
    ...

    The problem of this that one mapper will take a random set of rows and it is very likely that the number of distinct (dt, country) pairs will exceed the limit of hive.exec.max.dynamic.partitions.pernode. One way around it is to group the rows by the dynamic partition columns in the mapper and distribute them to the reducers where the dynamic partitions will be created. In this case the number of distinct dynamic partitions will be significantly reduced. The above example query could be rewritten to:

    hive> set hive.exec.dynamic.partition.mode=nonstrict;
    hive> FROM page_view_stg pvs
          INSERT OVERWRITE TABLE page_view PARTITION(dt, country)
                 SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, nullnull, pvs.ip,
                        from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country
                 DISTRIBUTE BY ds, country;

    This query will generate a MapReduce job rather than Map-only job. The SELECT-clause will be converted to a plan to the mappers and the output will be distributed to the reducers based on the value of (ds, country) pairs. The INSERT-clause will be converted to the plan in the reducer which writes to the dynamic partitions.

Additional documentation:

Inserting into Local Files  添加本地文件

In certain situations you would want to write the output into a local file so that you could load it into an excel spreadsheet. This can be accomplished with the following command:

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum'
SELECT pv_gender_sum.*
FROM pv_gender_sum;

Sampling 抽样插入

The sampling clause allows the users to write queries for samples of the data instead of the whole table. Currently the sampling is done on the columns that are specified in the CLUSTERED BY clause of the CREATE TABLE statement. In the following example we choose 3rd bucket out of the 32 buckets of the pv_gender_sum table:

INSERT OVERWRITE TABLE pv_gender_sum_sample
SELECT pv_gender_sum.*
FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32);

In general the TABLESAMPLE syntax looks like:

TABLESAMPLE(BUCKET x OUT OF y)

y has to be a multiple or divisor of the number of buckets in that table as specified at the table creation time. The buckets chosen are determined if bucket_number module y is equal to x. So in the above example the following tablesample clause

TABLESAMPLE(BUCKET 3 OUT OF 16)

would pick out the 3rd and 19th buckets. The buckets are numbered starting from 0.

On the other hand the tablesample clause

TABLESAMPLE(BUCKET 3 OUT OF 64 ON userid)

would pick out half of the 3rd bucket.

Union All

The language also supports union all, e.g. if we suppose there are two different tables that track which user has published a video and which user has published a comment, the following query joins the results of a union all with the user table to create a single annotated stream for all the video publishing and comment publishing events:

INSERT OVERWRITE TABLE actions_users
SELECT u.id, actions.date
FROM (
    SELECT av.uid AS uid
    FROM action_video av
    WHERE av.date = '2008-06-03'
 
    UNION ALL
 
    SELECT ac.uid AS uid
    FROM action_comment ac
    WHERE ac.date = '2008-06-03'
    ) actions JOIN users u ON(u.id = actions.uid);

Array Operations 数组操作

Array columns in tables can be as follows:

CREATE TABLE array_table (int_array_column ARRAY<INT>);

Assuming that pv.friends is of the type ARRAY<INT> (i.e. it is an array of integers), the user can get a specific element in the array by its index as shown in the following command:

SELECT pv.friends[2]
FROM page_views pv;

The select expression gets the third item in the pv.friends array.

The user can also get the length of the array using the size function as shown below:

SELECT pv.userid, size(pv.friends)
FROM page_view pv;

Map (Associative Arrays) Operations

Maps provide collections similar to associative arrays. Such structures can only be created programmatically currently. We will be extending this soon. For the purpose of the current example assume that pv.properties is of the type map<String, String> i.e. it is an associative array from strings to string. Accordingly, the following query:

INSERT OVERWRITE page_views_map
SELECT pv.userid, pv.properties['page type']
FROM page_views pv;

can be used to select the 'page_type' property from the page_views table.

Similar to arrays, the size function can also be used to get the number of elements in a map as shown in the following query:

SELECT size(pv.properties)
FROM page_view pv;

Custom Map/Reduce Scripts  [??]

Users can also plug in their own custom mappers and reducers in the data stream by using features natively supported in the Hive language. e.g. in order to run a custom mapper script - map_script - and a custom reducer script - reduce_script - the user can issue the following command which uses the TRANSFORM clause to embed the mapper and the reducer scripts.

Note that columns will be transformed to string and delimited by TAB before feeding to the user script, and the standard output of the user script will be treated as TAB-separated string columns. User scripts can output debug information to standard error which will be shown on the task detail page on hadoop.

FROM (
     FROM pv_users
     MAP pv_users.userid, pv_users.date
     USING 'map_script'
     AS dt, uid
     CLUSTER BY dt) map_output
 
 INSERT OVERWRITE TABLE pv_users_reduced
     REDUCE map_output.dt, map_output.uid
     USING 'reduce_script'
     AS date, count;

Sample map script (weekday_mapper.py )

import sys
import datetime
 
for line in sys.stdin:
  line = line.strip()
  userid, unixtime = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print ','.join([userid, str(weekday)])

Of course, both MAP and REDUCE are "syntactic sugar" for the more general select transform. The inner query could also have been written as such:

SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' AS dt, uid CLUSTER BY dt FROM pv_users;

Schema-less map/reduce: If there is no "AS" clause after "USING map_script", Hive assumes the output of the script contains 2 parts: key which is before the first tab, and value which is the rest after the first tab. Note that this is different from specifying "AS key, value" because in that case value will only contains the portion between the first tab and the second tab if there are multiple tabs.

In this way, we allow users to migrate old map/reduce scripts without knowing the schema of the map output. User still needs to know the reduce output schema because that has to match what is in the table that we are inserting to.

FROM (
    FROM pv_users
    MAP pv_users.userid, pv_users.date
    USING 'map_script'
    CLUSTER BY key) map_output
 
INSERT OVERWRITE TABLE pv_users_reduced
 
    REDUCE map_output.dt, map_output.uid
    USING 'reduce_script'
    AS date, count;

Distribute By and Sort By: Instead of specifying "cluster by", the user can specify "distribute by" and "sort by", so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required.

FROM (
    FROM pv_users
    MAP pv_users.userid, pv_users.date
    USING 'map_script'
    AS c1, c2, c3
    DISTRIBUTE BY c2
    SORT BY c2, c1) map_output
 
INSERT OVERWRITE TABLE pv_users_reduced
 
    REDUCE map_output.c1, map_output.c2, map_output.c3
    USING 'reduce_script'
    AS date, count;

Co-Groups

Amongst the user community using map/reduce, cogroup is a fairly common operation wherein the data from multiple tables are sent to a custom reducer such that the rows are grouped by the values of certain columns on the tables. With the UNION ALL operator and the CLUSTER BY specification, this can be achieved in the Hive query language in the following way. Suppose we wanted to cogroup the rows from the actions_video and action_comments table on the uid column and send them to the 'reduce_script' custom reducer, the following syntax can be used by the user:

FROM (
     FROM (
             FROM action_video av
             SELECT av.uid AS uid, av.id AS id, av.date AS date
 
            UNION ALL
 
             FROM action_comment ac
             SELECT ac.uid AS uid, ac.id AS id, ac.date AS date
     ) union_actions
     SELECT union_actions.uid, union_actions.id, union_actions.date
     CLUSTER BY union_actions.uid) map
 
 INSERT OVERWRITE TABLE actions_reduced
     SELECT TRANSFORM(map.uid, map.id, map.date) USING 'reduce_script' AS (uid, id, reduced_val);

[Hive - Tutorial] Querying and Inserting Data 查询和插入数据的更多相关文章

  1. MySql频繁查询、插入数据

    当我们需要频繁地从数据库查询.插入数据时,可以将这些数据库操作汇集写到同一个类里,作为工具类直接调用. 将数据库的具体信息保存在.properties文件中,用log4j作为日志记录 MySql.ja ...

  2. MYSQL查询和插入数据的流程是怎样的

    一个查询语句经过哪些步骤 这次我们从MySQL的整体架构来讲SQL的执行过程,如下图: 在整体分为两部分Server和引擎层,这里引擎层我使用InnoDB去代替,引擎层的设计是插件形式的,可以任意替代 ...

  3. MySql数据库-查询、插入数据时转义函数的使用

    最近在看一部php的基础视频教程,在做案例的时,当通过用户名查询用户信息的时候,先使用了转义函数对客户提交的内容进行过滤之后再交给sql语句进行后续的操作.虽然能看到转义函数本身的作用,但是仍然有一些 ...

  4. oracle 通过查询灵活插入数据 insert into ...select..

    insert into reg_user (id,name,password,area_code,reg_time,first_pswd,record_type) select l.reg_user_ ...

  5. Hive Tutorial 阅读记录

    Hive Tutorial 目录 Hive Tutorial 1.Concepts 1.1.What Is Hive 1.2.What Hive Is NOT 1.3.Getting Started ...

  6. Hive Tutorial(上)(Hive 入门指导)

    用户指导 Hive 指导 Hive指导 概念 Hive是什么 Hive不是什么 获得和开始 数据单元 类型系统 内置操作符和方法 语言性能 用法和例子(在<下>里面) 概念 Hive是什么 ...

  7. Hive通过查询语句向表中插入数据注意事项

    最近在学习使用Hive(版本0.13.1)的过程中,发现了一些坑,它们或许是Hive提倡的比关系数据库更加自由的体现(同时引来一些问题),或许是一些bug.总而言之,这些都需要使用Hive的开发人员额 ...

  8. [Hive - Tutorial] Type System 数据类型

    数据类型Type System Hive supports primitive and complex data types, as described below. See Hive Data Ty ...

  9. hive查询ncdc天气数据

    使用hive查询ncdc天气数据 在hive中将ncdc天气数据导入,然后执行查询shell,可以让hive自动生成mapredjob,快速去的想要的数据结果. 1. 在hive中创建ncdc表,这个 ...

随机推荐

  1. java:打包

    包名命名规范: 1.包名全部小写 2.包名一般情况下是域名的倒过来写+个性命名,如:tinyphp.com,就写成com.tinyphp+.xxx 打包方法 package + 包名 package ...

  2. Android TextView多行文本滚动实现

    Android中我们为了实现文本的滚动可以在ScrollView中嵌入一个TextView,其实TextView自己也可以实现多行滚动的,毕竟ScrollView必须只能有一个直接的子类布局.只要在l ...

  3. Java API —— 异常

    1.异常:异常就是Java程序在运行过程中出现的错误. 2.异常由来:问题也是现实生活中一个具体事务,也可以通过java 的类的形式进行描述,并封装成对象.其实就是Java对不正常情况进行描述后的对象 ...

  4. opencv 图像阴影检测

    参数说明: IplImage *workImg-当前全局变量,表示正在显示的图片. downleft, upright- 检测出的阴影部分矩形框的两个对角顶点. /****************** ...

  5. bootstrap 3.x笔记

    布局容器 Bootstrap 需要为页面内容和栅格系统包裹一个 .container 容器.我们提供了两个作此用处的类.注意,由于 padding等属性的原因,这两种 容器类不能互相嵌套. .cont ...

  6. 简单的算法题, Find Minimum in Rotated Sorted Array 的Python实现。

    简单的算法题, Find Minimum in Rotated Sorted Array 的Python实现. 题目: Suppose a sorted array is rotated at som ...

  7. C#实现Comparable接口实现排序

    C#中,实现排序的方法有两种,即实现Comparable或Comparer接口,下面简单介绍实现Comparable接口实现排序功能. 实现Comparable接口需要实现CompareTo(obje ...

  8. JVM学习笔记(二)------Java代码编译和执行的整个过程

    Java代码编译是由Java源码编译器来完成,流程图如下所示: Java字节码的执行是由JVM执行引擎来完成,流程图如下所示: Java代码编译和执行的整个过程包含了以下三个重要的机制: Java源码 ...

  9. git push default

    今天使用git push的时候出现了如下提示: warning: push.default is unset; its implicit value is changing in Git 2.0 fr ...

  10. bzoj4080

    分组赛时wy大神讲的题,网上都是随机化的题解 我来讲一下正解吧,我们穷举两个点,这两点距离要小于限制 然后我们分别以这两个点为圆心,两点距离为半径画圆 圆圆相交的部分被两点练成线段划分成两部分,不难发 ...