Amazon Redshift and Massively Parellel Processing
Today, Yelp held a tech talk in Columbia University about the data warehouse adopted by Yelp.
Yelp used Amazon Redshift as data warehouse.
There are several features for Redshift:
1. Massively Parellel Processing
2. SQL access
3. Column-based Datastore
Benefits are:
1. Data is structured, accessible and well documented.
2. Architecture allows for easy extensibility and sharing across teams.
3. Allows use of entire SQL-compatible tool ecosystem.
Details:
Massively Parellel Processing (MMP)
Traditional BigData always uses Hadoop + MapReduce. MapReduce's native control mechanism is Java code (to implement the Map and Reduce logic), whereas MPP products are queried with SQL(Structural Query Language). You can refer detail here.
Below is the structure for implementing MMP.

Similarly, Data is distributed across each segment database to achieve data and processing parallelism. This is achieved by creating a database table with DISTRIBUTED BY clause. By using this clause data is automatically distributed across segment databases. (referrence: Introduction to MMP)
Typical query sentence in MMP

Column-based Datastore
Enables sparse table definitions
Enables compact storage
Improve scanning/filtering
(Benefits: wiki)
Column-based Datastore
- Column-oriented organizations are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.
- Column-oriented organizations are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.
- Row-oriented organizations are more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek.
- Row-oriented organizations are more efficient when writing a new row if all of the row data is supplied at the same time, as the entire row can be written with a single disk seek.
In practice, row-oriented storage layouts are well-suited for OLTP-like workloads which are more heavily loaded with interactive transactions. Column-oriented storage layouts are well-suited for OLAP-like workloads (e.g., data warehouses) which typically involve a smaller number of highly complex queries over all data (possibly terabytes).
Amazon Redshift and Massively Parellel Processing的更多相关文章
- Amazon Redshift数据库
Amazon Redshift介绍 Amazon Redshift是一种可轻松扩展的完全托管型PB级数据仓库,它通过使用列存储技术和并行化多个节点的查询来提供快速的查询性能,使您能够更高效的分析现有数 ...
- Power BI连接至Amazon Redshift
一直在使用Power BI连接至MongoDB中,但效果一直不是太理想,今天使用另一种方法,将MongoDB中的数据通过Azure Data Factory转入Amazon Redshift中,而在P ...
- amazon redshift 分析型数据库特点——本质还是列存储
Amazon Redshift 是一种快速且完全托管的 PB 级数据仓库,使您可以使用现有的商业智能工具经济高效地轻松分析您的所有数据.从最低 0.25 USD 每小时 (不承担任何义务) 直到每年每 ...
- Amazon Redshift数据迁移到MaxCompute
Amazon Redshift数据迁移到MaxCompute Amazon Redshift 中的数据迁移到MaxCompute中经常需要先卸载到S3中,再到阿里云对象存储OSS中,大数据计算服务Ma ...
- POWER BI 基于 ODBC 数据源的配置刷新-以Amazon Redshift为例
POWER BI 基于 ODBC 数据源的配置刷新-以Amazon Redshift为例 Powerbi 有多种数据源连接,可以使用它们连接到不同数据源. 如果在 Power BI Desktop 的 ...
- Amazon Redshift and the Case for Simpler Data Warehouses
Redshift是Amazon一个商业产品上的进化 但并不是技术的进化,他使用的无非都是传统数仓领域的技术 如果说创新,就是大量使用Amazon本身的云服务的云原生架构,大大提升的产品的迭代速度,可维 ...
- Python 如何连接并操作 Aws 上 PB 级云数据仓库 Redshift
Python 如何连接并操作 Aws 上 PB 级云数据仓库 Redshift 一.简介 Amazon Redshift 是一个快速.可扩展的数据仓库,可以简单.经济高效地分析数据仓库和数据湖中的所有 ...
- Qwiklab'实验-DynamoDB, Redshift, Elasticsearch'
title: AWS之Qwiklab subtitle: 4. Qwiklab'实验-Amazon DynamoDB, Amazon Redshift, Elasticsearch Service' ...
- Massively parallel supercomputer
A novel massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures ba ...
随机推荐
- 【POJ3299】Humidex(简单的数学推导)
公式题中已经给出,直接求解即可. #include <iostream> #include <cstdlib> #include <cstdio> #include ...
- linux group
groups 查看当前登录用户的组内成员 groups gliethttp 查看gliethttp用户所在的组,以及组内成员 whoami 查看当前登录用户名 /etc/group文件包含所有组 ...
- Java中ThreadLocal无锁化线程封闭实现原理
虽然现在可以说很多程序员会用ThreadLocal,但是我相信大多数程序员还不知道ThreadLocal,而使用ThreadLocal的程序员大多只是知道其然而不知其所以然,因此,使用ThreadLo ...
- java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
此方法为Timestamp的 转换方法. 这几天做到excel导入功能,其中里面有几个时间时段,所以用了这个类来将导入的字符串格式转换Timestamp格式. 不慎出现了 java.lang.Ille ...
- [Git] set-upstream
When you want to push your local branch to remote branch, for the first push: git push --set-upstrea ...
- Android应用程序资源的查找过程分析
文章转载至CSDN社区罗升阳的安卓之旅,原文地址:http://blog.csdn.net/luoshengyang/article/details/8806798 我们知道,在Android系统中, ...
- 内外连接、组函数、DDL、DML和TCL
前言 cross join ,是笛卡尔积:nature join 是自然连接. 正文 内外连接 inner join inner join 的inner能够省略. 内连接 在一个表中可以找到在还有一个 ...
- 国内外移动端web适配屏幕方案
基础知识点 设备像素:设备像素又称物理像素(physical pixel),设备能控制显示的最小单位,我们可以把这些像素看作成显示器上一个个的点. iPhone5的物理像素是640X1136. PS: ...
- javascript无缝流畅动画轮播,终于让我给搞出来了。
自己一直想写一个真正能用的轮播图,以前是写过一个,但是不是无缝的轮播,感觉体验很差,这个轮播之前也搞了很多实例,看了很多代码,但是脑子总转不过弯,为什么在运动到一定距离后可以突然转回到原始位置,而没有 ...
- js中的eval方法转换对象时,为何一定要加上括号?
待转换的是一个Json字符串: {'name':'新欢'} 而使用如下这种方式调用则会抛出语法异常, eval("{'name':'新欢'}"); 必须加上括号才行 eval(&q ...