Building the Unstructured Data Warehouse: Architecture, Analysis, and Design

earn essential techniques from data warehouse legend Bill Inmon on how to build the reporting environment your business needs now!

Answers for many valuable business questions hide in text. How well can your existing reporting environment extract the necessary text from email, spreadsheets, and documents, and put it in a useful format for analytics and reporting? Transforming the traditional data warehouse into an efficient unstructured data warehouse requires additional skills from the analyst, architect, designer, and developer. This book will prepare you to successfully implement an unstructured data warehouse and, through clear explanations, examples, and case studies, you will learn new techniques and tips to successfully obtain and analyze text.

Master these ten objectives:

  • Build an unstructured data warehouse using the 11-step approach
  • Integrate text and describe it in terms of homogeneity, relevance, medium, volume, and structure
  • Overcome challenges including blather, the Tower of Babel, and lack of natural relationships
  • Avoid the Data Junkyard and combat the Spider's Web
  • Reuse techniques perfected in the traditional data warehouse and Data Warehouse 2.0, including iterative development
  • Apply essential techniques for textual Extract, Transform, and Load (ETL) such as phrase recognition, stop word filtering, and synonym replacement
  • Design the Document Inventory system and link unstructured text to structured data
  • Leverage indexes for efficient text analysis and taxonomies for useful external categorization
  • Manage large volumes of data using advanced techniques such as backward pointers
  • Evaluate technology choices suitable for unstructured data processing, such as data warehouse appliances

The following outline briefly describes each chapter's content:

  • Chapter 1 defines unstructured data and explains why text is the main focus of this book.
  • Chapter 2 addresses the challenges one faces when managing unstructured data.
  • Chapter 3 discusses the DW 2.0 architecture, which leads into the role of the unstructured data warehouse. The unstructured data warehouse is defined and benefits are given. There are several features of the conventional data warehouse that can be leveraged for the unstructured data warehouse, including ETL processing, textual integration, and iterative development.
  • Chapter 4 focuses on the heart of the unstructured data warehouse: Textual Extract, Transform, and Load (ETL).
  • Chapter 5 describes the 11 steps required to develop the unstructured data warehouse.
  • Chapter 6 describes how to inventory documents for maximum analysis value, as well as link the unstructured text to structured data for even greater value.
  • Chapter 7 goes through each of the different types of indexes necessary to make text analysis efficient. Indexes range from simple indexes, which are fast to create and are good if the analyst really knows what needs to be analyzed before the indexing process begins, to complex combined indexes, which can be made up of any and all of the other kinds of indexes.
  • Chapter 8 explains taxonomies and how they can be used within the unstructured data warehouse.
  • Chapter 9 explains ways of coping with large amounts of unstructured data. Techniques such as keeping the unstructured data at its source and using backward pointers are discussed. The chapter explains why iterative development is so important.
  • Chapter 10 focuses on challenges and some technology choices that are suitable for unstructured data processing. In addition, the data warehouse appliance is discussed.
  • Chapters 11, 12, and 13 put all of the previously discussed techniques and approaches in context through three case studies.

Building the Unstructured Data Warehouse: Architecture, Analysis, and Design的更多相关文章

  1. 对数据集“dsArea”执行查询失败。 (rsErrorExecutingCommand),Query execution failed for dataset 'dsArea'. (rsErrorExecutingCommand),Manually process the TFS data warehouse and analysis services cube

    错误提示: 处理报表时出错. (rsProcessingAborted)对数据集“dsArea”执行查询失败. (rsErrorExecutingCommand)Team System 多维数据集或者 ...

  2. Putting Apache Kafka To Use: A Practical Guide to Building a Stream Data Platform-part 1

    转自: http://www.confluent.io/blog/stream-data-platform-1/ These days you hear a lot about "strea ...

  3. DataBase vs Data Warehouse

    Database https://en.wikipedia.org/wiki/Database A database is an organized collection of data.[1] A ...

  4. data warehouse 1.0 vs 2.0

    data warehouse 1.01. EDW goal, separate data marts reqlity2. batch oriented etl3. IT driven BI - das ...

  5. Azure SQL 数据库仓库Data Warehouse (1) 入门

    <Windows Azure Platform 系列文章目录> 在之前的项目中遇到了客户使用SQL数据仓库的场景,在这里记录一下 1.什么是SQL 数据库仓库 (SQL DW) SQL D ...

  6. Data Warehouse 简介

    数据仓库定义 数据仓库之父Bill Inmon在1991年出版的“Building the Data Warehouse”一书中所提出的定义被广泛接受:数据仓库(Data Warehouse)是一个面 ...

  7. 混合 Data Warehouse 和 Big Data 倉庫的新架構

    (讀書筆記)許多公司,儘管想導入 Big Data,仍必須繼續用 Data Warehouse 來管理結構化的營運數據.系統記錄.而 Big Data 的出現,為 Data Warehouse 提供了 ...

  8. Azure SQL Data Warehouse

    Azure SQL Data Warehouse & AWS Redshift Amazon Redshift Amazon Redshift 是一种快速.完全托管的 PB 级数据仓库,可方便 ...

  9. 场景4 Data Warehouse Management 数据仓库

    场景4 Data Warehouse Management 数据仓库 parallel 4 100% —> 必须获得指定的4个并行度,如果获得的进程个数小于设置的并行度个数,则操作失败 para ...

随机推荐

  1. Python之路,第二篇:Python入门与基础2

    1,复合赋值运算符 +=   . -=  . *=  . /=  . //=  . %=  , **= x    +=    y     等同于  x   =   x  +  y x    -=    ...

  2. Android Studio安卓导出aar包与Unity 3D交互

    Unity与安卓aar 包交互 本文提供全流程,中文翻译. Chinar 坚持将简单的生活方式,带给世人!(拥有更好的阅读体验 -- 高分辨率用户请根据需求调整网页缩放比例) Chinar -- 心分 ...

  3. C4-ResNet-TF-小象cv-code

    https://blog.csdn.net/chaipp0607/article/details/75577305 https://blog.csdn.net/leastsq/article/deta ...

  4. 玩vue+mockjs

    玩vue+mockjs vue中用mock制造模拟接口(本文主要解决坑),一定要看完哦 最近新入职一家公司,后端造接口速度很慢,想来想去还是搞一套模拟接口,来满足开发需求,有人会问,我造一个死数据不就 ...

  5. skipper backend 负载均衡配置

    skipper 对于后端是支持负载均衡处理的,支持官方文档并没有提供,实际使用中,这个还是比较重要的 同时支持健康检查. 格式 hello_lb_group: Path("/foo" ...

  6. laravel使用过程中一些总结

    推荐连接: laravel辅助函数总结:https://laravel-china.org/docs/laravel/5.5/helpers 基于 Laravel 集成的 Monolog 库对日志进行 ...

  7. phpdocumentor安装和使用总结

    为了解决一校友在安装和使用phpDocumentor过程中遇到的问题,自己闲时也折腾了一下这个东西,总结见下: 一.定义: 自己刚听到这个词时还不知道这个是什么东西,干啥用的,就去百度了一下,说道: ...

  8. “更新时间”字段的:ON UPDATE CURRENT_TIMESTAMP 含义

    "更新时间"字段的:ON UPDATE CURRENT_TIMESTAMP 含义: 表示在数据库数据有更新的时候UPDATE_TIME的时间会自动更新(如果数据库数据值没有变化的话 ...

  9. apache2 配置虚拟主机

    查看 apache2 的配置位置: whereis apache2 我的在:/etc/apache2 sites-available  文件夹下面放的就是 虚拟站点的配置文件: 随便复制一个改改: c ...

  10. python re正则模块

    re 正则表达式操作  本模块提供了类似于Perl的正则表达式匹配操作.要匹配的模式和字符串可以是Unicode字符串以及8位字符串. 正则表达式使用反斜杠字符('\')来表示特殊的形式或者来允许使用 ...