我采用的是网上的电影大数据,共有3个文件,movies.dat、user.dat、ratings.dat。分别有3000/6000和1百万数据,正好做实验。

下面先介绍数据结构:

RATINGS FILE DESCRIPTION
================================================================================
All ratings are contained in the file "ratings.dat" and are in the
following format:

UserID::MovieID::Rating::Timestamp

- UserIDs range between 1 and 6040
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings
USERS FILE DESCRIPTION

================================================================================
User information is in the file "users.dat" and is in the following
format:

UserID::Gender::Age::Occupation::Zip-code

All demographic information is provided voluntarily by the users and is
not checked for accuracy. Only users who have provided some demographic
information are included in this data set.

- Gender is denoted by a "M" for male and "F" for female
- Age is chosen from the following ranges:

* 1: "Under 18"
* 18: "18-24"
* 25: "25-34"
* 35: "35-44"
* 45: "45-49"
* 50: "50-55"
* 56: "56+"

- Occupation is chosen from the following choices:

* 0: "other" or not specified
* 1: "academic/educator"
* 2: "artist"
* 3: "clerical/admin"
* 4: "college/grad student"
* 5: "customer service"
* 6: "doctor/health care"
* 7: "executive/managerial"
* 8: "farmer"
* 9: "homemaker"
* 10: "K-12 student"
* 11: "lawyer"
* 12: "programmer"
* 13: "retired"
* 14: "sales/marketing"
* 15: "scientist"
* 16: "self-employed"
* 17: "technician/engineer"
* 18: "tradesman/craftsman"
* 19: "unemployed"
* 20: "writer"

MOVIES FILE DESCRIPTION
================================================================================

Movie information is in the file "movies.dat" and is in the following
format:

MovieID::Title::Genres

- Titles are identical to titles provided by the IMDB (including
year of release)
- Genres are pipe-separated and are selected from the following genres:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western

****************************************************************************************************

二、进入重点

开始建库、建表:

create database movies;
use movies;
//试试建表
CREATE TABLE users(userid:Long);
create table users(userid:Bigint);
CREATE TABLE ratings(userid Int,movieid Int,rating Int,timestamp Timestamp)PARTITIONED BY(dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '::';
出错:FAILED: ParseException line 1:55 Failed to recognize predicate 'timestamp'. Failed rule: 'identifier' in column specification

timestamp不支持数据结构里的字符串,改之。

CREATE TABLE ratings(userid Int,movieid Int,rating Int,timestamped Timestamp)PARTITIONED BY(dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

LOAD DATA LOCAL INPATH '/home/dyq/Documents/movies/ratings-douhao.dat' into table ratings PARTITION(dt="20161201");
hive> select * from ratings limit 10;
OK
1 1193 5 NULL 20161201
1 661 3 NULL 20161201
1 914 3 NULL 20161201
1 3408 4 NULL 20161201
1 2355 5 NULL 20161201
1 1197 3 NULL 20161201
1 1287 5 NULL 20161201
1 2804 5 NULL 20161201
1 594 4 NULL 20161201
1 919 4 NULL 20161201

看来用"::"做分隔符有了麻烦,替换成我喜欢的","

drop table ratings;
CREATE TABLE ratings(userid Int,movieid Int,rating Int,timestamped String)PARTITIONED BY(dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

hive> select * from ratings limit 10;
OK
1 1193 5 978300760 20161201
1 661 3 978302109 20161201
1 914 3 978301968 20161201
1 3408 4 978300275 20161201
1 2355 5 978824291 20161201
1 1197 3 978302268 20161201
1 1287 5 978302039 20161201
1 2804 5 978300719 20161201
1 594 4 978302268 20161201
1 919 4 978301368 20161201
Time taken: 0.122 seconds, Fetched: 10 row(s)

一切OK!hive的语义真是不够强大的说。

下面建立Movies和users表。

CREATE TABLE movies(movieid Int,title String,genres String)PARTITIONED BY(dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

LOAD DATA LOCAL INPATH '/home/dyq/Documents/movies/movies-douhao.dat' into table movies PARTITION(dt="20161201");

CREATE TABLE users(userid Int,gender String,age Int,occupation String,zip-code String)PARTITIONED BY(dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

FAILED: ParseException line 1:73 cannot recognize input near '-' 'code' 'String' in column type

CREATE TABLE users(userid Int,gender String,age Int,occupation String,zipcode String)PARTITIONED BY(dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

LOAD DATA LOCAL INPATH '/home/dyq/Documents/movies/users-douhao.dat' into table users PARTITION(dt="20161201");

hive> select * from users limit 10;
OK
1 F 1 10 48067 20161201
2 M 56 16 70072 20161201
3 M 25 15 55117 20161201
4 M 45 7 02460 20161201
5 M 25 20 55455 20161201
6 F 50 9 55117 20161201
7 M 35 1 06810 20161201
8 M 25 12 11413 20161201
9 M 25 17 61614 20161201
10 F 35 1 95370 20161201
Time taken: 0.168 seconds, Fetched: 10 row(s)

*****************************************************************
创建索引:

create index ratings_userid_index on table ratings(userid) as 'COMPACT' with deferred rebuild;
show index on ratings;
drop index ratings_userid_index on ratings;

create index ratings_movieid_index on table ratings(movieid) as 'COMPACT' with deferred rebuild;
show index on ratings;
drop index ratings_movieid_index on ratings;

加索引前的join:
select movies.movieid,movies.title,ratings.rating from movies join ratings on(movies.movieid=ratings.movieid);
Time taken: 40.721 seconds, Fetched: 1000209 row(s)

加索引后的join:
Time taken: 40.816 seconds, Fetched: 1000209 row(s)

查询某一个值:
select movies.movieid,movies.title,ratings.rating from movies join ratings on(movies.movieid=ratings.movieid) where movies.movieid=2716;
Time taken: 33.834 seconds, Fetched: 2181 row(s)

索引后:
drop index ratings_movieid_index on ratings;
drop index ratings_userid_index on ratings;
select movies.movieid,movies.title,ratings.rating from movies join ratings on(movies.movieid=ratings.movieid) where movies.movieid=2716;

Time taken: 29.428 seconds, Fetched: 2181 row(s)

hive1.2.1实战操作电影大数据!的更多相关文章

  1. Java豆瓣电影爬虫——使用Word2Vec分析电影短评数据

    在上篇实现了电影详情和短评数据的抓取.到目前为止,已经抓了2000多部电影电视以及20000多的短评数据. 数据本身没有规律和价值,需要通过分析提炼成知识才有意义.抱着试试玩的想法,准备做一个有关情感 ...

  2. Java豆瓣电影爬虫——抓取电影详情和电影短评数据

    一直想做个这样的爬虫:定制自己的种子,爬取想要的数据,做点力所能及的小分析.正好,这段时间宝宝出生,一边陪宝宝和宝妈,一边把自己做的这个豆瓣电影爬虫的数据采集部分跑起来.现在做一个概要的介绍和演示. ...

  3. Mysql备份系列(3)--innobackupex备份mysql大数据(全量+增量)操作记录

    在日常的linux运维工作中,大数据量备份与还原,始终是个难点.关于mysql的备份和恢复,比较传统的是用mysqldump工具,今天这里推荐另一个备份工具innobackupex.innobacku ...

  4. Druid:一个用于大数据实时处理的开源分布式系统

    Druid是一个用于大数据实时查询和分析的高容错.高性能开源分布式系统,旨在快速处理大规模的数据,并能够实现快速查询和分析.尤其是当发生代码部署.机器故障以及其他产品系统遇到宕机等情况时,Druid仍 ...

  5. 基于Hadoop的大数据平台实施记——整体架构设计[转]

    http://blog.csdn.net/jacktan/article/details/9200979 大数据的热度在持续的升温,继云计算之后大数据成为又一大众所追捧的新星.我们暂不去讨论大数据到底 ...

  6. 基于Hadoop的大数据平台实施记——整体架构设计

    大数据的热度在持续的升温,继云计算之后大数据成为又一大众所追捧的新星.我们暂不去讨论大数据到底是否适用于您的组织,至少在互联网上已经被吹嘘成无所不能的超级战舰.好像一夜之间我们就从互联网时代跳跃进了大 ...

  7. 大数据实时处理-基于Spark的大数据实时处理及应用技术培训

    随着互联网.移动互联网和物联网的发展,我们已经切实地迎来了一个大数据 的时代.大数据是指无法在一定时间内用常规软件工具对其内容进行抓取.管理和处理的数据集合,对大数据的分析已经成为一个非常重要且紧迫的 ...

  8. 基于Hadoop2.0、YARN技术的大数据高阶应用实战(Hadoop2.0\YARN\Ma

    Hadoop的前景 随着云计算.大数据迅速发展,亟需用hadoop解决大数据量高并发访问的瓶颈.谷歌.淘宝.百度.京东等底层都应用hadoop.越来越多的企 业急需引入hadoop技术人才.由于掌握H ...

  9. 了解大数据的技术生态系统 Hadoop,hive,spark(转载)

    首先给出原文链接: 原文链接 大数据本身是一个很宽泛的概念,Hadoop生态圈(或者泛生态圈)基本上都是为了处理超过单机尺度的数据处理而诞生的.你能够把它比作一个厨房所以须要的各种工具. 锅碗瓢盆,各 ...

随机推荐

  1. [转]Android - 文件读写操作 总结

     转自:http://blog.csdn.net/ztp800201/article/details/7322110 Android - 文件读写操作 总结 分类: Android2012-03-05 ...

  2. php 入门1

    一.php在引入文件和js引入文件的区别 1.php在引入文件是用代码控制,而js是通过标签的属性src引入: 2.php引入可以在引入下写代码,而js是不可以的 3.静态效果的js可以引入的时间,引 ...

  3. ios上的 button和input-button为什么不水平居中的

    在iphone6plus上的button中文本上不居中,如下图: 造成的原因,是button的padding不为零,造成的,因而设置padding: 0:就可以解决

  4. Linux 利用进程打开的文件描述符(/proc)恢复被误删文件

    Linux 利用进程打开的文件描述符(/proc)恢复被误删文件 在 windows 上删除文件时,如果文件还在使用中,会提示一个错误:但是在 linux 上删除文件时,无论文件是否在使用中,甚至是还 ...

  5. shell十三问

    1) 为何叫做 shell ?在介绍 shell 是甚幺东西之前,不妨让我们重新检视使用者与计算机系统的关系:图(FIXME)我们知道计算机的运作不能离开硬件,但使用者却无法直接对硬件作驱动,硬件的驱 ...

  6. Webform Application、ViewState

    Application(全局对象) Application对象生存期和Web应用程序生存期一样长,生存期从Web应用程序网页被访问开始,HttpApplication类对象Application被自动 ...

  7. HTML5音频视频-视频播放

  8. iOS 开发:利用第三方插件来安装CoCoapods

    引言:通过上一篇博客我们知道了怎么样去通过终端来安装CoCoapods,这一篇我们着重与用第三方插件来安装CoCoapods: 1. 首先在提下链接下载插件 https://github.com/ka ...

  9. DLL注入

    最近的项目涉及了软件破解方面的知识,记录一下. 将dll注入另一个进程. // Inject.cpp : Defines the exported functions for the DLL appl ...

  10. SQL 递归

    -- 查询指定部门下面的所有部门, 并汇总各部门的下级部门数 ) SET @Dept_name = N'MIS' ;WITH DEPTS AS( -- 查询指定部门及其下的所有子部门 -- 定位点成员 ...