A Small Definition of Big Data

The term "big data" seems to be popping up everywhere these days. And there seems to be as many uses of this term as there are contexts in which you find it: 'big data' is often used to refer to any dataset that is difficult to manage using traditional database systems; it is also used as a catch-all term for any collection of data that is too large to process on a single server; yet others use the term to simply eman "a lot of data"; sometimes it turns out it doesn't even have to be large. So what exactly is big data?

A precise specification of 'big' is elusive. What is considered big for one organization may be small for another. What is large-scale today will likely seem small-scale in the near future. petabyte is the new terabyte. Thus, size alone cannot specify big data. The complexity of the data is an import factor that must also be considered.

Most now agree with the characterizating of big data using the 3 V's coined by Doug Laney of Gartner:

  • Volume: this refers to the vast amounts of data that is generated every second/min-ute/hour/day in our digitized world.
  • Velocity: This refers to the speed at which data is beging generated and the pace at which data moves from one point to the next.
  • Variety: This refers to the ever-increasing different forms that data can come in, e.g., text, images, voice, geospatial.
  • A fourth V is now also sometimes added:
  • Veracity: This refers to the quality of the data, which can vary greatly.
  • The above V's are the dimensions that characterize big data, and also embody its challenges; We have huge amounts of data, in different formats and varing quality, that must be processed quickly.
  • It is important to note that the goal of processing big data is to gain insight to support decision-making. It is not sufficient to just be able to capture and store the data. The point of collecting and processing volumes of complex data is to understand trends, uncover hidden patterns, detect anomalies, etc. So that you have a better understanding of the problem being analyzed and can make more informed, data-driven decisions. In fact, many consider value as the fifth V of big data:
  • Value: Processing big data must bring about value from insights gained.

To address the challenges of big data, innovative technologies are needed. Parallel, distributed computing paradigms, scalable machine learning algorithms, and real-time querying are key to analysis of big data. Distributed file systems, computing clusters, cloud computing, and data stores supporting data variety and agility are also necessary to provide the infrastructure for processing of big data. Workflows provide an intuitive, reusable, scalable and reproducible way to process big data to gain verifiable value from it in and enable application of same methods to different datasets.

With all the data generated from social media, smart sensors satellites, surveillance carmera, the Internet, and countless other devices, big data is all around us. The endeavor to make sense out of that data brings about exciting opportunities indeed!

Data Science

Data Science is about extracting knowledge from data. At the WorkDS Center, we define data science as a multidisciplinary craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability. Publications and provenance of the data products leading to these publications are also important for data science.

  • People: The data scientists are often seen as people who possess skills on a variety of topics including: science or business domain knowledge; analysis using statistics, machine learning and mathematical knowledge; data management, programming and computing. In practice, this is generally a group of researchers comprised of people with complementary skills.
  • Process: The process of data science includes techniques for statistics, machine learning, programming, computing and data management. Data science workflows combine such steps in executable graphs. we believe that process-origented thinking is a transformative way of conducting data science to connect people and techniques to applications. Challenges for the data science progress include 1) how to easily integrate all needed tasks to build such a progress; 2) how to find the best computing resources and efficiently schedule process executions to the resource based on process definition, parameter settings, and user preferences.
  • Purpose: Purpose comes when people use generalizable processes with a particular goal in mind. The purpose can be related to a scientific analysis with a hypothesis or a business metric that needs to be analyzed based often on Big Data. Note that similar reusable processes can be applicable to many applications with different purposes when employed within different workflows.
  • Platforms: Based on the needs of an application-driven purpose and the amount of data and computing required to perform this application, different computing and data platforms can be used as a part of the data science process. This scalability should be made part of any data science solution architecture.
  • Programmability: Capturing a scalable data science process requires aid from programming languages, e.g. R, and patterns, e.g., MapReduce. Tools that provide access to such programming techniques are key to making the data science process programmable on a variety of platforms.

Execution of such a data science process requires access to many datasets, Big and small, bringing new opportunities and challenges to Data Science. There are many Data Science steps or tasks, such as Data Collection, Data Cleaning, Data Processing/Analysis, Result Visualization, resulting in a Data Science Workflow. Data Science Processes may need user interaction and other manual operations, or be fully automated.

MapReduce

Map-Reduce is a scalable programming model that simplifies distributed processing of data. Map-Reduce consists of three main steps: Mapping, Shuffling and Reducing. An easy way to think about a Map-Reduce job is to compare it with act of 'delegating' a large task to a group of people, and combining the result of each person's effort, to produce the final outcome.

Let's take an example to bring the point across. You just heard about this great news at your office, and are throwing a party for all your colleagues! You decide to cook Pasta for the dinner. Four of your friends, who like cooking, also volunteer to join you in preparation. The task of preparing Pasta broadly involves chopping the vegetables, cooking, and garnishing.

Let's take the job of chopping the vegetables and see how it is analogous to map-reduce task. Here the raw vegetables are symbolic of the input data, your friends are equivalent to compute nodes, and final chopped vagetables are analogous to desired outcome. Each friend is allotted onions, tomatoes and peppers to chop and weigh.

You would also like to know how much of each vegetable types you have in the kitchen. You would also like to chop these vegetables while this calculation is occurring. In the end, the onions should be in one large bowl with a label that displays its weight in pounds, tomatoes in aseparate one, and so on.

MAP: To start with, you assign each of your four friends a random mix of different types of vegetables. They are required to use their 'compute' powers to chop them and measure the weight of each type of veggie. They need to ensure not mix different types of veggies. So each friend will generate a mapping of <key,value> pairs that looks like:

Friend X:

<tomatoes, 5lbs>

<onions, 10lbs>

<garlic, 2lbs>

Friend Y:

<onions,      22lbs>

<green peppers, 5lbs>

……

Seems like you are having a really big party! Now that your friends have chopped the vegetables, and labeled each bowl with the weight and type of vegetable, we move to the next stage: Shuffling.

SHUFFLE: This stage is also called Grouping. Here you want to group the veggies by their types. You assign different parts of your kitchen to each type of veggie, and your friends are supposed to group the bowls, so that like items are placed together.

North End of Kitchen:

<tomatoes, 5lbs>

<tomatoes, 11lbs>

West End of Kitchen:

<onions, 10lbs>

<onions, 22lbs>

<onions, 1.4lbs>

East End of Kitchen:

<green peppers, 3lbs>

<green peppers, 10lbs>

The party start in a couple of hours, but you are impressed by what your friends have accomplished by Mapping and Grouping so far! The kitchen looks much more organized now and the raw material is chopped. The final stage of this task is to measure how much of each veggie you actually have. This brings us to the Reduce stage.

REDUCE: In this stage, you ask each of your friend to collect items of same type, put them in a large bowl, and label this large bowl with sum of individual bowl weights. Your friends cannot wait for the party to start, and immediately start 'reducing' small bowls. In the end, you have nice large bowls, with total weight of each vegetable labeled on it.

Your friends('compute nodes') just performed a Map-Reduce task to help you get started with cooking the Pasta. Since you were coordinating the entire exercise, you are "The Master" node of this Map-Reduce task. Each of your friends took roles of Mappers, Groupers and Reducers at different times. This example demonstrates the power of this technique.

This simple and powerful technique can be scaled very easily if more of your friends decide to join you.

A Small Definition of Big Data的更多相关文章

  1. Toward Scalable Systems for Big Data Analytics: A Technology Tutorial (I - III)

    ABSTRACT Recent technological advancement have led to a deluge of data from distinctive domains (e.g ...

  2. TYPES、DATA、TYPE、LIKE、CONSTANTS、STATICS、TABLES

    声明:原创作品,转载时请注明文章来自SAP师太技术博客( 博/客/园www.cnblogs.com):www.cnblogs.com/jiangzhengjun,并以超链接形式标明文章原始出处,否则将 ...

  3. CAST function should support INT synonym for SIGNED. i.e. CAST(y AS INT)

      Login / Register Developer Zone Bugs Home Report a bug Statistics Advanced search Saved searches T ...

  4. Examples of MIB Variables - SNMP Tutorial

    30.5 Examples of MIB Variables Versions 1 and 2 of SNMP each collected variables together in a singl ...

  5. xml的解析与创建——bing到youdao导入文件的转换

    首先是为了解决一个问题:如何将必应单词本中记录的单词转入到有道词典中去.实际上,必应词典可以导出xml文件,但是该文件有道词典无法解析.这里涉及到xml的解析和创建了. 代码如下: import ja ...

  6. 直接放个DB2 SQL STATEMENT大全好了!

    SQL statements   This topic contains tables that list the SQL statements classified by type. SQL sch ...

  7. MEMORY Storage Engine MEMORY Tables TEMPORARY TABLE max_heap_table_size

    http://dev.mysql.com/doc/refman/5.7/en/create-table.html You can use the TEMPORARY keyword when crea ...

  8. ORACLE恢复数据

    ORACLE恢复删除表或表记录 一:表的恢复      对误删的表,只要没有使用PURGE永久删除选项,那么从flash back区恢复回来希望是挺大的.一般步骤有: 1.从flash back里查询 ...

  9. sphinx配置文件sphinx.conf参数详细说明

    sphinx配置文件sphinx.conf参数详细说明 sphinx.conf各个参数详细说明 # # Sphinx configuration file sample # # WARNING! Wh ...

随机推荐

  1. Vue自定义过滤器格式化数字三位加一逗号

    <template> <div class="index-compont"> <div class="totalCount"> ...

  2. Web自动化测试框架-PO模式

    Web自动化测试框架(WebTestFramework)是基于Selenium框架且采用PageObject设计模式进行二次开发形成的框架. 一.适用范围:传统Web功能自动化测试.H5功能自动化测试 ...

  3. h5学习-canvas绘制矩形、圆形、文字、动画

    绘制矩形<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8 ...

  4. C#代码规范(简版)

    C#项目代码规范 目的 1.方便代码的交流和维护. 2.不影响编码的效率,不与大众习惯冲突. 3.使代码更美观.阅读更方便. 4.使代码的逻辑更清晰.更易于理解. 在C#中通常使用的两种编码方式如下 ...

  5. mac 下使用gcc 编译c函数

    mac 终端其实和window 的cmd类似,由于mac 的os x 采用了unix 系统,所以,各种类似UNIX下的命令都有用.最近在看computer science ,用到了命令行. 下面是一个 ...

  6. Asp.Net控件的客户端命名

    我们在用ASP.NET写出来的网页,用浏览器来查看生成的客户端代码的时候经常看到这样的代码:GridView1_ctl101_WebUserControl1_webuserControlButton, ...

  7. 《哈佛商业评论》2017年第5期:4星。成功CEO具有4种行为特质:果断、激励参与、主动适应、稳扎稳打。股东价值最大化的理念有重大缺陷。

    老牌管理学杂志,每期都值得精度.本期几个比较重要的观点:谦逊的CEO能带来更好的业绩:飞利浦创新过度导致业绩下滑:股东最大化的理念有重大缺陷,后果之一是大宗股票的临时持有者可能干预公司事务,强迫公司采 ...

  8. iTOP-6818开发板设置NFS共享目录的实现

    NFS 共享目录的制作过程.主要分为两个步骤:1.搭建 NFS 服务器2.配置内核. NFS 是 Network FileSystem 的缩写,是由 SUN 公司研制的 UNIX 表示层协议(pres ...

  9. Android(java)学习笔记201:JNI之helloword案例(利用NDK工具)

    1. 逻辑思路过程图: 2.下面通过一个HelloWorld案例来说明一下JNI利用NDK开发过程(步骤) 分析:我们在Win7系统下编译的C语言代码,我们知道C语言依赖操作系统,不能跨平台,所以我们 ...

  10. 循环实现数组 map 方法

    //循环实现数组 map 方法 const selfMap = function (fn, context) { let arr = Array.prototype.slice.call(this) ...