Data analysis system
A data analysis system, particularly, a system capable of efficiently analyzing big data is provided. The data analysis system includes an analyst server, at least one data storage unit, a client terminal independent of the analyst server, and a caching device independent of the analyst server. The caching device includes a caching memory, a data transmission interface, and a controller for obtaining a data access pattern of the client terminal with respect to the at least one data storage unit, performing caching operations on the at least one data storage unit according to a caching criterion to obtain and store cache data in the caching memory, and sending the cache data to the analyst server via the data transmission interface, such that the analyst server analyzes the cache data to generate an analysis result, which may be used to request a change in the caching criterion.
BACKGROUND
1. Field of the Invention
The present invention relates to data analysis systems, and more particularly, to a system for analyzing big data according to caching criteria of a caching device.
2. Background of the Related Art
With information devices being in wide use, data sources nowadays are becoming more abundant. In addition to conventional manual input and system computation, data is generated at every moment as a result of the Internet, the emergence of cloud computing, the rapid development of mobile computing and the Internet of Things (IOT), and the ubiquitous mobile apparatuses, RFID, and wireless sensors.
Big data cannot work by itself. A large storage unit is required to provide sufficient data storage space. A caching device, especially a solid-state storage device, typically stores data replicas in the large storage unit (for example, a hard disk drive) to speed up data access of the system.
BRIEF SUMMARY
One embodiment of the present invention provides a data analysis system comprising an analyst server, at least one data storage unit, a client terminal independent of the analyst server, and a caching device independent of the analyst server. The caching device comprises a cache memory, a data transmission interface, and a controller in communication with the analyst server, the client terminal, and the storage unit. The controller obtains a data access pattern of the client terminal with respect to the storage unit and performs caching operations on the storage unit according to a caching criterion to obtain and store cache data in the cache memory and send the cache data to the analyst server via the data transmission interface, thereby allowing the analyst server to analyze the cache data and generate an analysis result.
Another embodiment of the present invention provides a caching device comprising a cache memory, a data transmission interface, and a controller connected to the cache memory and the data transmission interface. The controller obtains a data access pattern of a client terminal with respect to a storage unit and performs caching operations on the storage unit according to a caching criterion to obtain and store cache data in the cache memory and send the cache data to an analyst server via the data transmission interface.
Yet another embodiment of the present invention provides a data processing method comprising: (a) obtaining a data access pattern of a client terminal with respect to a data storage unit, (b) performing caching operations on the data storage unit according to a caching criterion to thereby obtain and store cache data in the cache memory, and (c) sending the cache data to an analyst server via the data transmission interface so as for the analyst server to analyze the cache data and thereby generate an analysis result.
DETAILED DESCRIPTION
Embodiments of the present invention select useful information from big data in a short period of time with methods and tools to analyze the useful information thus selected. For example, traffic on highways can be instantly smoothened by quickly identifying a key section of a road rather than the road in its entirety, analyzing its traffic flow data, and allocating lanes accordingly.
Instead of analyzing all the data in a storage device directly, the present invention discloses enabling a caching device to monitor a data access pattern of a client terminal with respect to the storage device in real time, cache appropriate or crucial data replicas from the storage device according to caching criteria to meet a wide variety of objectives and needs of data analysis, and send out the data replicas to serve as samples for data analysis.
For example, if hot data is regarded as a caching criterion, then the caching device will retrieve and send the hot data to the analyst server for analysis. The hot data, for example, includes video, personal or corporate data or stock-related data, which is intensively accessed within a fixed period of time for analysis by the analyst server. Afterward, characteristics of hot data are used in making operation policy, for example, placing popular video data at a server near the client terminal to enhance performance and service quality.
According to an embodiment of the present invention, a data analysis system comprises an analyst server, at least one data storage unit, a client terminal independent of the analyst server, and a caching device independent of the analyst server. The caching device further comprises a cache memory, a data transmission interface, and a controller connected to the analyst server, the client terminal, and the storage unit. The controller obtains a data access pattern of the client terminal with respect to the at least one data storage unit, performs caching operations on the at least one data storage unit according to a caching criterion to obtain and store cache data in the caching memory, and sends the cache data to the analyst server via the data transmission interface, such that the analyst server analyzes the cache data to generate an analysis result.
In another embodiment, the present invention further provides a caching device for use in the data analysis system and a data processing method for use with the caching device.
Reference throughout this specification to features, advantages, or similar language does not imply that all of the features and advantages that may be realized with the present invention should be or are in any single embodiment of the invention. Rather, language referring to the features and advantages is understood to mean that a specific feature, advantage, or characteristic described in connection with an embodiment is included in at least one embodiment of the present invention. Thus, discussion of the features and advantages, and similar language, throughout this specification may, but do not necessarily, refer to the same embodiment.
Furthermore, the described features, advantages, and characteristics of the invention may be combined in any suitable manner in one or more embodiments. One skilled in the relevant art will recognize that the invention may be practiced without one or more of the specific features or advantages of a particular embodiment. In other instances, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments of the invention.
Referring now to FIG. 1 through FIG. 3, computer systems, methods, and computer program products are illustrated as structural or functional block diagrams or process flowcharts according to various embodiments of the present invention. The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
<Data Analysis System>
FIG. 1 is a block diagram of a data analysis system 10 according to an embodiment of the present invention. The data analysis system 10 comprises an analyst server 100, a client terminal 102, a storage unit 104, and a caching device 106.FIG. 1 is not restrictive of the quantity of an analyst server, a storage unit, a client terminal, and a caching device of the data analysis system of the present invention.
The analyst server 100 is a server, for example, IBM's System X, Blade Center or eServer server, which has programs for executing data analytic applications, such as Microsoft's SQL Server products.
The client terminal 102 is independent of the analyst server 100 and is exemplified by a personal computer, a mobile device, or another server, which does not limit the present invention.
The storage unit 104 may, for example, be in the form of a network-attached storage (NAS), a storage area network (SAN), or a direct attached storage (DAS) to enable the client terminal 102 to perform data access. However, the storage unit 104 can be directly connected to the client terminal 102 to function as a local device for use with the client terminal 102, and the present invention is not limited thereto.
The caching device 106 is also independent of the analyst server 100. Related details are described below in conjunction with FIG. 2.
The analyst server 100, the client terminal 102, the storage unit 104, and the caching device 106 are linked, as needed, by a local bus, a local area network, the Internet, or any other data transmission channel to perform data communication. In a preferred embodiment, the caching device 106 is directly linked to the storage unit 104 via a local bus (not shown). To enhance stability and security, the analyst server 100 is independent of the client terminal 102, the storage unit 104, and the caching device 106.
<Caching Device>
FIG. 2 is a block diagram of the caching device 106 in accordance with one embodiment. The caching device 106 further comprises a cache memory 200, a controller 202, and a data transmission interface 204. Preferably, the cache memory200 is a solid-state memory (for example, a flash memory) which reads and writes data faster than the storage unit 104does, though the present invention is not limited thereto. The cache memory 200 may, for example, be in the form of a hard disk drive or any other storage device. The cache memory 200 and the controller 202 are linked, as needed, by a local bus, a local area network, the Internet, or any other data transmission channel to perform data communication.
The controller 202 is able to perform conventional caching operations and stores cache data (that is, replicas of specific data in the storage unit 104) in the cache memory 200. Hence, the client terminal 102 (as shown in FIG. 1) reads and writes data from the cache memory 200 directly, rather than reads and writes data from the storage unit 104 slowly. The improvements of the controller 202 and its conventional counterparts are described below in conjunction with the flow chart of FIG. 3.
<Caching Criteria>
Step 300: the controller 202 monitors how the client terminal 102 performs data access to the storage unit 104 within a given period and calculates a data access pattern, e.g., access frequency. In this embodiment, the data access pattern is provided as a log of data access performed by the client terminal 102 to the storage unit 104 within a given period, and thus those portions of the data access pattern which are not related to the present invention are omitted.
Step 302: in this step, the controller 202 performs caching operations on the storage unit 104 according to a caching criterion so as to obtain cache data (that is, replicas of specific data in the storage unit 104) and store the cache data in the cache memory 200.
In an embodiment, a caching criterion may relate to a given access frequency, and thus cache data may be defined as data (i.e., hot data) acquired as a result of access by the client terminal 102 to the storage unit 104 within a given period when the access frequency exceeds a given value. Alternatively, cache data may be defined as data (i.e., cold data) acquired at an access frequency below a given value. Likewise, it is also feasible to set the caching criterion to a given range of access frequency.
In another embodiment, a caching criterion may relate to a given access sequence. For example, cache data may be defined as data, which consists of the latest 1000 pieces of data or the earliest 500 pieces of data, acquired as a result of access by the client terminal 102 to the storage unit 104. Likewise, it is feasible to set the caching criterion to a given range of access sequence.
In yet another embodiment, a caching criterion may relate to a given access period. For example, cache data may be defined as data acquired as a result of access by the client terminal 102 to the storage unit 104 before or after a specific point in time. Likewise, it is feasible to set the caching criterion to a given range of access period.
In a further embodiment, a caching criterion may relate to a given data address. For example, cache data may be defined as data acquired as a result of access by the client terminal 102 to the storage unit 104 at a given data address. Likewise, it is feasible to set the caching criterion to a given range of data addresses.
In a still further embodiment, a caching criterion may relate to a given data size. For example, cache data may be defined as data acquired as a result of access by the client terminal 102 to the storage unit 104, wherein the size of the data acquired is larger or smaller than a given data size. Likewise, it is feasible to set the caching criterion to a given range of data size.
In another embodiment, a caching criterion may relates to a given string. For example, cache data may be defined as data acquired as a result of access by the client terminal 102 to the storage unit 104, wherein the data acquired has a given string. Likewise, it is feasible to set the caching criterion to any particular combination of strings.
In an additional embodiment, a caching criterion may relate to a given value of at least a parameter contained in the data access pattern. Hence, in step 300, the caching criterion may be defined as a given value of a parameter available in the data access pattern calculated by the controller 202. For example, if the data access pattern comprises a data-related file name, a given file name can function as the caching criterion.
Step 302 does not necessarily follow step 300. Step 300 and step 302 can take place simultaneously, provided that cache data in step 302 is acquired after step 300.
Step 304: the controller 202 sends cache data stored in the cache memory 200 to the analyst server 100 via the data transmission interface 204. If the caching device 106 is mounted on a motherboard (not shown), the data transmission interface 204 can be a PCI-e interface or an InfiniBand interface.
Step 306: the analyst server 100 analyzes cache data to generate an analysis result. For example, an analysis result may be generated using SQL Server products of Microsoft Corporation, which are applicable to data mining as described in "Predictive Analysis with SQL Server 2008", a White Paper published by Microsoft Corporation. The present invention is not restrictive of a way of analyzing cache data.
Step 308: selectively, the analyst server 100 sends an instruction to the controller 202 to change the caching criterion, and then the process flow of the method goes back to step 300, or will go back to step 302 if the data access pattern need not be updated. Afterward, the process flow of the method proceeds to steps 304-306.
SRC=https://www.google.com.hk/patents/US20140068180
Data analysis system的更多相关文章
- Learning Spark: Lightning-Fast Big Data Analysis 中文翻译
Learning Spark: Lightning-Fast Big Data Analysis 中文翻译行为纯属个人对于Spark的兴趣,仅供学习. 如果我的翻译行为侵犯您的版权,请您告知,我将停止 ...
- An Introduction to Stock Market Data Analysis with R (Part 1)
Around September of 2016 I wrote two articles on using Python for accessing, visualizing, and evalua ...
- HiBench成长笔记——(7) 阅读《The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis》
<The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis>内容精选 We th ...
- 《利用Python进行数据分析: Python for Data Analysis 》学习随笔
NoteBook of <Data Analysis with Python> 3.IPython基础 Tab自动补齐 变量名 变量方法 路径 解释 ?解释, ??显示函数源码 ?搜索命名 ...
- Python for Data Analysis
Data Analysis with Python ch02 一些有趣的数据分析结果 Male描述的是美国新生儿男孩纸的名字的最后一个字母的分布 Female描述的是美国新生儿女孩纸的名字的最后一个字 ...
- 深入浅出数据分析 Head First Data Analysis Code 数据与代码
<深入浅出数据分析>英文名为Head First Data Analysis Code, 这本书中提供了学习使用的数据和程序,原书链接由于某些原因不 能打开,这里在提供一个下载的链接.去下 ...
- How to use data analysis for machine learning (example, part 1)
In my last article, I stated that for practitioners (as opposed to theorists), the real prerequisite ...
- 数据分析---《Python for Data Analysis》学习笔记【04】
<Python for Data Analysis>一书由Wes Mckinney所著,中文译名是<利用Python进行数据分析>.这里记录一下学习过程,其中有些方法和书中不同 ...
- 数据分析---《Python for Data Analysis》学习笔记【03】
<Python for Data Analysis>一书由Wes Mckinney所著,中文译名是<利用Python进行数据分析>.这里记录一下学习过程,其中有些方法和书中不同 ...
随机推荐
- php课程 1-3 web项目中php、html、js代码的执行顺序是怎样的(详解)
php课程 1-3 web项目中php.html.js代码的执行顺序是怎样的(详解) 一.总结 一句话总结:b/s结构 总是先执行服务器端的先.js是客户端脚本 ,是最后执行的.所以肯定是php先执行 ...
- MySQL数据导出导入任务脚本
#!/usr/bin/env python#-*- encoding: utf8 -*- import timeimport osimport mysql.connector #定义一些全局变量 w ...
- ios_webView
iOS开发中WebView的使用 在AppDelegate.m文件里 view sourceprint" class="item about" style="c ...
- js进阶解决浏览器缓存不能自动更新的问题(在ajax的url上带上一个参数,可以是日期,或者是随机数)(随机数Math.random)(取得日期的毫秒数:new Date().getTime();)
js进阶解决浏览器缓存不能自动更新的问题(在ajax的url上带上一个参数,可以是日期,或者是随机数)(随机数Math.random)(取得日期的毫秒数:new Date().getTime();) ...
- google analytics是什么(免费的网站流量分析服务:比如分析有多少个人来了你的网站,告诉你怎么样才能在网站上面实现最大收益。)
google analytics是什么(免费的网站流量分析服务:比如分析有多少个人来了你的网站,告诉你怎么样才能在网站上面实现最大收益.) 一.总结 免费的网站流量分析服务:比如分析有多少个人来了你的 ...
- C#操作SqlServer MySql Oracle通用帮助类
C#操作SqlServer MySql Oracle通用帮助类 [前言] 作为一款成熟的面向对象高级编程语言,C#在ADO.Net的支持上已然是做的很成熟,我们可以方便地调用ADO.Net操作各类关系 ...
- thinkphp3.2.3 小程序获取手机号 php 解密
首先是把这个文件夹放到\ThinkPHP\Library\Org里面 //zll 根据加密字符串和session_key和iv获取手机号 /** * [getphone description] * ...
- thinkphp3.2.3 excel导出,下载文件,包含图片
关于导出后出错的问题 https://segmentfault.com/q/1010000005330214 https://blog.csdn.net/ohmygirl/article/detail ...
- [转] Python 爬虫的工具列表 附Github代码下载链接
转自http://www.36dsj.com/archives/36417 这个列表包含与网页抓取和数据处理的Python库 网络 通用 urllib -网络库(stdlib). requests - ...
- Intent七在属性之一:ComponentName 分类: H1_ANDROID 2013-11-10 10:54 1184人阅读 评论(1) 收藏
注:在<疯狂android讲义>中,此属性称为Component,官方文档中称为ComponentName. 1.The name of the component that should ...