A data analysis system, particularly, a system capable of efficiently analyzing big data is provided. The data analysis system includes an analyst server, at least one data storage unit, a client terminal independent of the analyst server, and a caching device independent of the analyst server. The caching device includes a caching memory, a data transmission interface, and a controller for obtaining a data access pattern of the client terminal with respect to the at least one data storage unit, performing caching operations on the at least one data storage unit according to a caching criterion to obtain and store cache data in the caching memory, and sending the cache data to the analyst server via the data transmission interface, such that the analyst server analyzes the cache data to generate an analysis result, which may be used to request a change in the caching criterion.

BACKGROUND

1. Field of the Invention

The present invention relates to data analysis systems, and more particularly, to a system for analyzing big data according to caching criteria of a caching device.

2. Background of the Related Art

With information devices being in wide use, data sources nowadays are becoming more abundant. In addition to conventional manual input and system computation, data is generated at every moment as a result of the Internet, the emergence of cloud computing, the rapid development of mobile computing and the Internet of Things (IOT), and the ubiquitous mobile apparatuses, RFID, and wireless sensors.

Big data cannot work by itself. A large storage unit is required to provide sufficient data storage space. A caching device, especially a solid-state storage device, typically stores data replicas in the large storage unit (for example, a hard disk drive) to speed up data access of the system.

BRIEF SUMMARY

One embodiment of the present invention provides a data analysis system comprising an analyst server, at least one data storage unit, a client terminal independent of the analyst server, and a caching device independent of the analyst server. The caching device comprises a cache memory, a data transmission interface, and a controller in communication with the analyst server, the client terminal, and the storage unit. The controller obtains a data access pattern of the client terminal with respect to the storage unit and performs caching operations on the storage unit according to a caching criterion to obtain and store cache data in the cache memory and send the cache data to the analyst server via the data transmission interface, thereby allowing the analyst server to analyze the cache data and generate an analysis result.

Another embodiment of the present invention provides a caching device comprising a cache memory, a data transmission interface, and a controller connected to the cache memory and the data transmission interface. The controller obtains a data access pattern of a client terminal with respect to a storage unit and performs caching operations on the storage unit according to a caching criterion to obtain and store cache data in the cache memory and send the cache data to an analyst server via the data transmission interface.

Yet another embodiment of the present invention provides a data processing method comprising: (a) obtaining a data access pattern of a client terminal with respect to a data storage unit, (b) performing caching operations on the data storage unit according to a caching criterion to thereby obtain and store cache data in the cache memory, and (c) sending the cache data to an analyst server via the data transmission interface so as for the analyst server to analyze the cache data and thereby generate an analysis result.

DETAILED DESCRIPTION

Embodiments of the present invention select useful information from big data in a short period of time with methods and tools to analyze the useful information thus selected. For example, traffic on highways can be instantly smoothened by quickly identifying a key section of a road rather than the road in its entirety, analyzing its traffic flow data, and allocating lanes accordingly.

Instead of analyzing all the data in a storage device directly, the present invention discloses enabling a caching device to monitor a data access pattern of a client terminal with respect to the storage device in real time, cache appropriate or crucial data replicas from the storage device according to caching criteria to meet a wide variety of objectives and needs of data analysis, and send out the data replicas to serve as samples for data analysis.

For example, if hot data is regarded as a caching criterion, then the caching device will retrieve and send the hot data to the analyst server for analysis. The hot data, for example, includes video, personal or corporate data or stock-related data, which is intensively accessed within a fixed period of time for analysis by the analyst server. Afterward, characteristics of hot data are used in making operation policy, for example, placing popular video data at a server near the client terminal to enhance performance and service quality.

According to an embodiment of the present invention, a data analysis system comprises an analyst server, at least one data storage unit, a client terminal independent of the analyst server, and a caching device independent of the analyst server. The caching device further comprises a cache memory, a data transmission interface, and a controller connected to the analyst server, the client terminal, and the storage unit. The controller obtains a data access pattern of the client terminal with respect to the at least one data storage unit, performs caching operations on the at least one data storage unit according to a caching criterion to obtain and store cache data in the caching memory, and sends the cache data to the analyst server via the data transmission interface, such that the analyst server analyzes the cache data to generate an analysis result.

In another embodiment, the present invention further provides a caching device for use in the data analysis system and a data processing method for use with the caching device.

Reference throughout this specification to features, advantages, or similar language does not imply that all of the features and advantages that may be realized with the present invention should be or are in any single embodiment of the invention. Rather, language referring to the features and advantages is understood to mean that a specific feature, advantage, or characteristic described in connection with an embodiment is included in at least one embodiment of the present invention. Thus, discussion of the features and advantages, and similar language, throughout this specification may, but do not necessarily, refer to the same embodiment.

Furthermore, the described features, advantages, and characteristics of the invention may be combined in any suitable manner in one or more embodiments. One skilled in the relevant art will recognize that the invention may be practiced without one or more of the specific features or advantages of a particular embodiment. In other instances, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments of the invention.

Referring now to FIG. 1 through FIG. 3, computer systems, methods, and computer program products are illustrated as structural or functional block diagrams or process flowcharts according to various embodiments of the present invention. The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

<Data Analysis System>

FIG. 1 is a block diagram of a data analysis system 10 according to an embodiment of the present invention. The data analysis system 10 comprises an analyst server 100, a client terminal 102, a storage unit 104, and a caching device 106.FIG. 1 is not restrictive of the quantity of an analyst server, a storage unit, a client terminal, and a caching device of the data analysis system of the present invention.

The analyst server 100 is a server, for example, IBM's System X, Blade Center or eServer server, which has programs for executing data analytic applications, such as Microsoft's SQL Server products.

The client terminal 102 is independent of the analyst server 100 and is exemplified by a personal computer, a mobile device, or another server, which does not limit the present invention.

The storage unit 104 may, for example, be in the form of a network-attached storage (NAS), a storage area network (SAN), or a direct attached storage (DAS) to enable the client terminal 102 to perform data access. However, the storage unit 104 can be directly connected to the client terminal 102 to function as a local device for use with the client terminal 102, and the present invention is not limited thereto.

The caching device 106 is also independent of the analyst server 100. Related details are described below in conjunction with FIG. 2.

The analyst server 100, the client terminal 102, the storage unit 104, and the caching device 106 are linked, as needed, by a local bus, a local area network, the Internet, or any other data transmission channel to perform data communication. In a preferred embodiment, the caching device 106 is directly linked to the storage unit 104 via a local bus (not shown). To enhance stability and security, the analyst server 100 is independent of the client terminal 102, the storage unit 104, and the caching device 106.

<Caching Device>

FIG. 2 is a block diagram of the caching device 106 in accordance with one embodiment. The caching device 106 further comprises a cache memory 200, a controller 202, and a data transmission interface 204. Preferably, the cache memory200 is a solid-state memory (for example, a flash memory) which reads and writes data faster than the storage unit 104does, though the present invention is not limited thereto. The cache memory 200 may, for example, be in the form of a hard disk drive or any other storage device. The cache memory 200 and the controller 202 are linked, as needed, by a local bus, a local area network, the Internet, or any other data transmission channel to perform data communication.

The controller 202 is able to perform conventional caching operations and stores cache data (that is, replicas of specific data in the storage unit 104) in the cache memory 200. Hence, the client terminal 102 (as shown in FIG. 1) reads and writes data from the cache memory 200 directly, rather than reads and writes data from the storage unit 104 slowly. The improvements of the controller 202 and its conventional counterparts are described below in conjunction with the flow chart of FIG. 3.

<Caching Criteria>

Step 300: the controller 202 monitors how the client terminal 102 performs data access to the storage unit 104 within a given period and calculates a data access pattern, e.g., access frequency. In this embodiment, the data access pattern is provided as a log of data access performed by the client terminal 102 to the storage unit 104 within a given period, and thus those portions of the data access pattern which are not related to the present invention are omitted.

Step 302: in this step, the controller 202 performs caching operations on the storage unit 104 according to a caching criterion so as to obtain cache data (that is, replicas of specific data in the storage unit 104) and store the cache data in the cache memory 200.

In an embodiment, a caching criterion may relate to a given access frequency, and thus cache data may be defined as data (i.e., hot data) acquired as a result of access by the client terminal 102 to the storage unit 104 within a given period when the access frequency exceeds a given value. Alternatively, cache data may be defined as data (i.e., cold data) acquired at an access frequency below a given value. Likewise, it is also feasible to set the caching criterion to a given range of access frequency.

In another embodiment, a caching criterion may relate to a given access sequence. For example, cache data may be defined as data, which consists of the latest 1000 pieces of data or the earliest 500 pieces of data, acquired as a result of access by the client terminal 102 to the storage unit 104. Likewise, it is feasible to set the caching criterion to a given range of access sequence.

In yet another embodiment, a caching criterion may relate to a given access period. For example, cache data may be defined as data acquired as a result of access by the client terminal 102 to the storage unit 104 before or after a specific point in time. Likewise, it is feasible to set the caching criterion to a given range of access period.

In a further embodiment, a caching criterion may relate to a given data address. For example, cache data may be defined as data acquired as a result of access by the client terminal 102 to the storage unit 104 at a given data address. Likewise, it is feasible to set the caching criterion to a given range of data addresses.

In a still further embodiment, a caching criterion may relate to a given data size. For example, cache data may be defined as data acquired as a result of access by the client terminal 102 to the storage unit 104, wherein the size of the data acquired is larger or smaller than a given data size. Likewise, it is feasible to set the caching criterion to a given range of data size.

In another embodiment, a caching criterion may relates to a given string. For example, cache data may be defined as data acquired as a result of access by the client terminal 102 to the storage unit 104, wherein the data acquired has a given string. Likewise, it is feasible to set the caching criterion to any particular combination of strings.

In an additional embodiment, a caching criterion may relate to a given value of at least a parameter contained in the data access pattern. Hence, in step 300, the caching criterion may be defined as a given value of a parameter available in the data access pattern calculated by the controller 202. For example, if the data access pattern comprises a data-related file name, a given file name can function as the caching criterion.

Step 302 does not necessarily follow step 300. Step 300 and step 302 can take place simultaneously, provided that cache data in step 302 is acquired after step 300.

Step 304: the controller 202 sends cache data stored in the cache memory 200 to the analyst server 100 via the data transmission interface 204. If the caching device 106 is mounted on a motherboard (not shown), the data transmission interface 204 can be a PCI-e interface or an InfiniBand interface.

Step 306: the analyst server 100 analyzes cache data to generate an analysis result. For example, an analysis result may be generated using SQL Server products of Microsoft Corporation, which are applicable to data mining as described in "Predictive Analysis with SQL Server 2008", a White Paper published by Microsoft Corporation. The present invention is not restrictive of a way of analyzing cache data.

Step 308: selectively, the analyst server 100 sends an instruction to the controller 202 to change the caching criterion, and then the process flow of the method goes back to step 300, or will go back to step 302 if the data access pattern need not be updated. Afterward, the process flow of the method proceeds to steps 304-306.

SRC=https://www.google.com.hk/patents/US20140068180

Data analysis system的更多相关文章

  1. Learning Spark: Lightning-Fast Big Data Analysis 中文翻译

    Learning Spark: Lightning-Fast Big Data Analysis 中文翻译行为纯属个人对于Spark的兴趣,仅供学习. 如果我的翻译行为侵犯您的版权,请您告知,我将停止 ...

  2. An Introduction to Stock Market Data Analysis with R (Part 1)

    Around September of 2016 I wrote two articles on using Python for accessing, visualizing, and evalua ...

  3. HiBench成长笔记——(7) 阅读《The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis》

    <The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis>内容精选 We th ...

  4. 《利用Python进行数据分析: Python for Data Analysis 》学习随笔

    NoteBook of <Data Analysis with Python> 3.IPython基础 Tab自动补齐 变量名 变量方法 路径 解释 ?解释, ??显示函数源码 ?搜索命名 ...

  5. Python for Data Analysis

    Data Analysis with Python ch02 一些有趣的数据分析结果 Male描述的是美国新生儿男孩纸的名字的最后一个字母的分布 Female描述的是美国新生儿女孩纸的名字的最后一个字 ...

  6. 深入浅出数据分析 Head First Data Analysis Code 数据与代码

    <深入浅出数据分析>英文名为Head First Data Analysis Code, 这本书中提供了学习使用的数据和程序,原书链接由于某些原因不 能打开,这里在提供一个下载的链接.去下 ...

  7. How to use data analysis for machine learning (example, part 1)

    In my last article, I stated that for practitioners (as opposed to theorists), the real prerequisite ...

  8. 数据分析---《Python for Data Analysis》学习笔记【04】

    <Python for Data Analysis>一书由Wes Mckinney所著,中文译名是<利用Python进行数据分析>.这里记录一下学习过程,其中有些方法和书中不同 ...

  9. 数据分析---《Python for Data Analysis》学习笔记【03】

    <Python for Data Analysis>一书由Wes Mckinney所著,中文译名是<利用Python进行数据分析>.这里记录一下学习过程,其中有些方法和书中不同 ...

随机推荐

  1. 为什么通过空指针(NULL)能够正确调用类的部分成员函数

    #include <iostream> using namespace std; class B { public: void foo() { cout << "B ...

  2. 想在子线程里面触发的信号的槽函数在子线程执行,信号槽连接必须使用DirectConnection 方式(即使跨线程,也可以强迫DirectConnection,而不能是AutoConnection)

    Qt多线程的实现 1.继承QThread,重新run 2.继承Object,调用moveToThread方法 两种方法各有利弊:主要参考:http://blog.51cto.com/9291927/1 ...

  3. Codeforces Round #315 (Div. 2) (ABCD题解)

    比赛链接:http://codeforces.com/contest/569 A. Music time limit per test:2 seconds memory limit per test: ...

  4. hibernate---java.lang.UnsupportedOperationException: The user must supply a JDBC connection

        在配置hibernate时.运行代码时一直抛错: Exception in thread "main" java.lang.UnsupportedOperationExce ...

  5. 【note】缩写词

    CoE CANopen EtherCAT应用程序概要文件CANopen™是一个注冊商标的能够自己主动化汽车集团..纽伦堡.德国CiA402CANopen™驱动器配置文件里指定的IEC 61800-7- ...

  6. [Recompose] Add Lifecycle Hooks to a Functional Stateless Component using Recompose

    Learn how to use the 'lifecycle' higher-order component to conveniently use hooks without using a cl ...

  7. 在word中使用notepad++实现代码的语法高亮 分类: C_OHTERS 2013-09-22 10:38 2273人阅读 评论(0) 收藏

    转载自:http://blog.csdn.net/woohello/article/details/7621651 有时写文档时需要将代码粘贴到word中,但直接粘贴到word中的代码虽能保持换行与缩 ...

  8. 一起学Python:TCP简介

    TCP介绍 TCP协议,传输控制协议(英语:Transmission Control Protocol,缩写为 TCP)是一种面向连接的.可靠的.基于字节流的传输层通信协议,由IETF的RFC 793 ...

  9. 【u002】数列排序(seqsort)

    Time Limit: 1 second Memory Limit: 128 MB [问题描述] 给定一个数列{an},这个数列满足ai≠aj(i≠j),现在要求你把这个数列从小到大排序,每次允许你交 ...

  10. Android组件——使用DrawerLayout仿网易新闻v4.4侧滑菜单

    摘要: 转载请注明出处:http://blog.csdn.net/allen315410/article/details/42914501 概述        今天这篇博客将记录一些关于DrawerL ...