from:https://github.com/chuanconggao/PrefixSpan-py

API Usage

Alternatively, you can use the algorithms via API.

from prefixspan import PrefixSpan

db = [
[0, 1, 2, 3, 4],
[1, 1, 1, 3, 4],
[2, 1, 2, 2, 0],
[1, 1, 1, 2, 2],
] ps = PrefixSpan(db)

For details of each parameter, please refer to the PrefixSpan class in prefixspan/api.py.

设置长度限制:

ps = PrefixSpan(db)
ps.minlen = 3
ps.maxlen = 5
print("?"*66)
------------------
print(ps.frequent(2))
# [(2, [0]),
# (4, [1]),
# (3, [1, 2]),
# (2, [1, 2, 2]),
# (2, [1, 3]),
# (2, [1, 3, 4]),
# (2, [1, 4]),
# (2, [1, 1]),
# (2, [1, 1, 1]),
# (3, [2]),
# (2, [2, 2]),
# (2, [3]),
# (2, [3, 4]),
# (2, [4])] print(ps.topk(5))
# [(4, [1]),
# (3, [2]),
# (3, [1, 2]),
# (2, [1, 3]),
# (2, [1, 3, 4])] print(ps.frequent(2, closed=True)) print(ps.topk(5, closed=True)) print(ps.frequent(2, generator=True)) print(ps.topk(5, generator=True))

Closed Patterns and Generator Patterns

一个 频繁的顺序模式 是一种出现在序列数据库的至少“minsup”序列中的模式,其中 最小支持度 是用户设置的参数。

一个 频繁闭合序列模式 是一种频繁的顺序模式,使得它不包括在具有完全相同支持的另一顺序模式中。

算法如 的PrefixSpan 找到频繁的顺序模式。算法如 BIDE+找到频繁的闭合序列模式。 BIDE +通常比PrefixSpan快得多,因为它使用修剪技术来避免生成所有顺序模式。此外,闭合模式集通常比连续模式集小得多,因此BIDE +也更具存储效率。

另一个重要的事情是,闭合序列模式是所有序列模式的紧凑和无损表示。这意味着闭合序列模式的集合通常要小得多,但它是无损的,这意味着它允许恢复整个连续模式集(没有信息丢失),这非常方便。

我可以举个简单的例子。

让我们考虑4个序列:

a  b  c  d  e
a b d
b e a
b c d e

让我们说minsup = 2。

b c 是一种频繁的序列模式,因为它出现在两个序列中(它支持2)。 b c 不是一个封闭的顺序模式,因为它包含在一个更大的顺序模式中 b c d 得到同样的支持。

b c d 它也是一个支持2.它也不是一个封闭的顺序模式,因为它包含在一个更大的顺序模式中 b c d e 得到同样的支持。 b c d e 是一个封闭的顺序模式,因为它没有包含在具有相同支持的任何其他顺序模式中。

The closed patterns are much more compact due to the smaller number.

  • A pattern is closed if there is no super-pattern with the same frequency.
prefixspan-cli frequent 2 --closed test.dat

0 : 2
1 : 4
1 2 : 3
1 2 2 : 2
1 3 4 : 2
1 1 1 : 2

The generator patterns are even more compact due to both the smaller number and the shorter lengths.

  • A pattern is generator if there is no sub-pattern with the same frequency.

  • Due to the high compactness, generator patterns are useful as features for classification, etc.

prefixspan-cli frequent 2 --generator test.dat

0 : 2
1 1 : 2
2 : 3
2 2 : 2
3 : 2
4 : 2

There are patterns that are both closed and generator.

prefixspan-cli frequent 2 --closed --generator test.dat

0 : 2

备注:模式挖掘有很多算法。

SPMF offers implementations of the following data mining algorithms.

Sequential Pattern Mining

These algorithms discover sequential patterns in a set of sequences. For a good overview of sequential pattern mining algorithms, please read this survey paper.

Sequential Rule Mining

These algorithms discover sequential rules in a set of sequences.

Sequence Prediction

These algorithms predict the next symbol(s) of a sequence based on a set of training sequences

Itemset Mining

These algorithms discover interesting itemsets (sets of values) that appear in a transaction database (database records containing symbolic data). For a good overview of itemset mining, please read this survey paper.

  • algorithms for discovering frequent itemsets in a transaction database.

  • algorithms for discovering frequent closed itemsets in a transaction database.
  • algorithms for recovering all frequent itemsets from frequent closed itemsets:
    • the LevelWise algorithm (Pasquier et al., 1999) 
    • the DFI-Growth algorithm (___ et al., 2018) 
  • algorithms for discovering frequent maximal itemsets in a transaction database.
    • the FPMax algorithm (Grahne and Zhu, 2003)
    • the Charm-MFI algorithm for discovering frequent closed itemsets and maximal frequent itemsets by post-processing in a transaction database (Szathmary et al. 2006)
  • algorithms for mining frequent itemsets with multiple minimum supports
  • algorithms for mining generator itemsets in a transaction database
    • the DefMe algorithm for mining frequent generator itemsets in a transaction database (Soulet & Rioult, 2014)
    • the Pascal algorithm for mining frequent itemsets, and identifying at the same time which one are generators (Bastide et al., 2002)
    • the Zart algorithm for discovering frequent closed itemsets and their generators in a transaction database (Szathmary et al. 2007)
  • algorithms for mining rare itemsets and/or correlated itemsets in a transaction database
    • the AprioriInverse algorithm for mining perfectly rare itemsets (Koh & Roundtree, 2005)
    • the AprioriRare algorithm for mining minimal rare itemsets and frequent itemsets (Szathmary et al. 2007b)
    • the CORI algorithm for mining minimal rare correlated itemsets using the support and bond measures (Bouasker et al. 2015)
    • the RP-Growth algorithm for mining rare itemsets (Tsang et al., 2011) 
  • algorithms for performing targeted and dynamic queries about association rules and frequent itemsets.
    • the Itemset-Tree, a data structure that can be updated incrementally, and algorithms for querying it. (Kubat et al, 2003)
    • the Memory-Efficient Itemset-Tree, a data structure that can be updated incrementally, and algorithms for querying it. (Fournier-Viger, 2013powerpoint)
  • algorithms to discover frequent itemsets in a stream
    • the estDec algorithm for mining recent frequent itemsets in a data stream (Chang & Lee, 2003)
    • the estDec+ algorithm for mining recent frequent itemsets in a data stream (Shin et al., 2014)
    • the CloStream algorithm for mining frequent closed itemsets in a data stream (Yen et al, 2009)
  • the U-Apriori algorithm for mining frequent itemsets in uncertain data (Chui et al, 2007)
  • the VME algorithm for mining erasable itemsets (Deng & Xu, 2010)
  • algorithms to discover fuzzy frequent itemsets in a quantitative transaction database

Periodic Pattern Mining

These algorithms discover patterns that periodically appear in a sequence of complex events (also called a transaction database)

  • the PFPM algorithm (Fournier-Viger et al, 2016apowerpointvideo  ) for mining frequent periodic patterns in a sequence of transactions (a transaction database))
  • the PHM algorithm (Fournier-Viger et al, 2016bpowerpoint) for mining periodic high-utility patterns (periodic patterns that yield a high profit) in a sequence of transactions (a transaction database) containing utility information 

Episode Mining

These algorithms discover episodes that appear in a single sequence of complex events.

  • the TUP algorithm (Rathore et al., 2016) for mining the top-k high utility episodes in a sequence of complex events (a transaction database) with utility information 
  • the US-SPAN algorithm (Wu et al., 2013) for mining high utility episodes in a sequence of complex events (a transaction database) with utility information 

High-Utility Pattern Mining

These algorithms discover patterns having a high utility (importance) in different kinds of data. For a good overview of high utility itemset mining, you may read this survey paper, and the high utility-pattern mining book.

  • algorithms for mining high-utility itemsets in a transaction database having profit information

  • algorithm for efficiently mining high-utility itemsets with length constraints in a transaction database
  • algorithm for mining correlated high-utility itemsets in a transaction database
  • algorithm for mining high-utility itemsets in a transaction database containing negative unit profit values
  • algorithm for mining frequent high-utility itemsets in a transaction database
  • algorithm for mining on-shelf high-utility itemsets in a transaction database containing information about time periods of items
  • algorithm for incremental high-utility itemset mining in a transaction database
  • algorithm for mining concise representations of high-utility  itemsets in a transaction database
  • algorithm for mining the skyline high-utility itemsets in a transaction database
  • algorithm for mining the top-k high-utility itemsets in a transaction database
  • algorithms for mining the top-k high utility itemsets from a data stream with a window
  • algorithm for mining frequent skyline utility patterns in a transaction database
  • algorithm for mining quantitative high utility itemsets in a transaction database:
  • algorithm for mining high-utility sequential rules in a sequence database 
  • algorithm for mining high-utility sequential patterns in a sequence database 
    • the USPAN algorithm (Yin et al. 2012)
  • algorithm for mining high-utility probability sequential patterns in a sequence database 
  • algorithm for mining high-utility itemsets in a transaction database using evolutionary algorithms
  • algorithm for mining high average-utility itemsets in a transaction database
    • the HAUI-Miner algorithm for mining high average-utility itemsets (Lin et al, 2016)
    • the EHAUPM algorithm for mining high average-utility itemsets (Lin et al, 2017
    • the HAUI-MMAU algorithm for mining high average-utility itemsets with multiple thresholds (Lin et al, 2016)
    • the MEMU algorithm for mining high average-utility itemsets with multiple thresholds (Lin et al, 2018)
  • algorithms for mining high utility episodes in a sequence of complex events (a transaction database)
    • the TUP algorithm (Rathore et al., 2016) for mining frequent periodic patterns in a sequence of transactions (a transaction database))
    • the UP-SPAN algorithm (Wu et al., 2013) for mining periodic high-utility patterns (periodic patterns that yield a high profit) in a sequence of transactions (a transaction database) containing utility information 
  • algorithms for mining periodic high-utility patterns (periodic patterns that yield a high profit) in a sequence of transactions (a transaction database) containing utility information
  • algorithms for discovering irregular high utility itemsets (non periodic patterns) in a transaction database with utility information
    • the PHM_irregular algorithm, which is a simple variation of the PHM algorithm 
  • algorithm for discovering local high utility itemsets in a database with utility information and timestamps
  • algorithm for discovering peak high utility itemsets in a database with utility information and timestamps

Association Rule Mining

These algorithms discover interesting associations between symbols (values) in a transaction database (database records with binary attributes).

  • an algorithm for mining all association rules in a transaction database (Agrawal & Srikant, 1994)
  • an algorithm for mining all association rules with the lift measure in a transaction database (adapted from Agrawal & Srikant, 1994)
  • an algorithm for mining the IGB informative and generic basis of association rules in a transaction database (Gasmi et al., 2005)
  • an algorithm for mining perfectly sporadic association rules (Koh & Roundtree, 2005)
  • an algorithm for mining closed association rules (Szathmary et al. 2006).
  • an algorithm for mining minimal non redundant association rules (Kryszkiewicz, 1998)
  • the Indirect algorithm for mining indirect association rules (Tan et al. 2000; Tan et 2006)
  • the FHSAR algorithm for hiding sensitive association rules (Weng et al. 2008)
  • the TopKRules algorithm for mining the top-k association rules (Fournier-Viger, 2012bpowerpoint)
  • the TopKClassRules algorithm for mining the top-k class association rules (a variation of TopKRules. This latter is described in Fournier-Viger, 2012bpowerpoint)
  • the TNR algorithm for mining top-k non-redundant association rules (Fournier-Viger 2012dpowerpoint)

Stream pattern mining

These algorithms discovers various kinds of patterns in a stream (an infinite sequence of database records (transactions))

  • the estDec algorithm for mining recent frequent itemsets in a data stream (Chang & Lee, 2003)
  • the estDec+ algorithm for mining recent frequent itemsets in a data stream (Shin et al., 2014)
  • the CloStream algorithm for mining frequent closed itemsets in a data stream (Yen et al, 2009)
  • algorithms for mining the top-k high utility itemsets from a data stream with a window

Clustering

These algorithms automatically find clusters in different kinds of data

  • the original K-Means algorithm (MacQueen, 1967)
  • the Bisecting K-Means algorithm (Steinbach et al, 2000)
  • algorithms for density-based clustering
    • the DBScan algorithm (Ester et al., 1996)
    • the Optics algorithm to extract a cluster ordering of points, which can then be use to generate DBScan style clusters and more (Ankerst et al, 1999)
  • hierarchical clustering algorithm
  • a tool called Cluster Viewer for visualizing clusters
  • a tool called Instance Viewer for visualizing the input of clustering algorithms

Time series mining

These algorithms perform various tasks to analyze time series data

    • an algorithm for converting a time series to a sequence of symbols using the SAX representation of time series. Note that if one converts a set of time series with SAX, he will obtain a sequence database, which allows to then apply traditional algorihtms for sequential rule mining and sequential pattern mining on time series (SAX, 2007).
    • algorithms for calculating the prior moving average of a time series (to remove noise)
    • algorithms for calculating the cumulative moving average f a time series (to remove noise)
    • algorithms for calculating the central moving average of a time series (to remove noise)
    • an algorithm for calculating the median smoothing of a time series (to remove noise)
    • an algorithm for calculating the exponential smoothing of a time series (to remove noise) 
    • an algorithm for calculating the min max normalization of a time series 
    • an algorithm for calculating the autocorrelation function of a time series 
    • an algorithm for calculating the standardization of a time series 
    • an algorithm for calculating the first and second order differencing of a time series
    • an algorithm for calculating the piecewise aggregate approximation of a time series (to reduce the number of data points of a time series)
    • an algorithm for calculating the linear regression of a time series (using the least squares method) 
    • an algorithm for splitting a time series into segments of a given length
    • an algorithm for splitting a time series into a given number of segments
    • algorithms to cluster time series (group time-series according to their similarities). This can be done by applying the clustering algorithms offered in SPMF (K-Means, Bisecting K-Means, DBScan, OPTICS, Hierarchical clustering) on time series.
    • a tool called Time Series Viewer for visualizing time series 
 

prefixspan python的更多相关文章

  1. 数据挖掘经典算法PrefixSpan的一个简单Python实现

    前言 用python实现了一个没有库依赖的"纯" py-based PrefixSpan算法. Github 仓库 https://github.com/Holy-Shine/Pr ...

  2. 用Spark学习FP Tree算法和PrefixSpan算法

    在FP Tree算法原理总结和PrefixSpan算法原理总结中,我们对FP Tree和PrefixSpan这两种关联算法的原理做了总结,这里就从实践的角度介绍如何使用这两个算法.由于scikit-l ...

  3. Python中的多进程与多线程(一)

    一.背景 最近在Azkaban的测试工作中,需要在测试环境下模拟线上的调度场景进行稳定性测试.故而重操python旧业,通过python编写脚本来构造类似线上的调度场景.在脚本编写过程中,碰到这样一个 ...

  4. Python高手之路【六】python基础之字符串格式化

    Python的字符串格式化有两种方式: 百分号方式.format方式 百分号的方式相对来说比较老,而format方式则是比较先进的方式,企图替换古老的方式,目前两者并存.[PEP-3101] This ...

  5. Python 小而美的函数

    python提供了一些有趣且实用的函数,如any all zip,这些函数能够大幅简化我们得代码,可以更优雅的处理可迭代的对象,同时使用的时候也得注意一些情况   any any(iterable) ...

  6. JavaScript之父Brendan Eich,Clojure 创建者Rich Hickey,Python创建者Van Rossum等编程大牛对程序员的职业建议

    软件开发是现时很火的职业.据美国劳动局发布的一项统计数据显示,从2014年至2024年,美国就业市场对开发人员的需求量将增长17%,而这个增长率比起所有职业的平均需求量高出了7%.很多人年轻人会选择编 ...

  7. 可爱的豆子——使用Beans思想让Python代码更易维护

    title: 可爱的豆子--使用Beans思想让Python代码更易维护 toc: false comments: true date: 2016-06-19 21:43:33 tags: [Pyth ...

  8. 使用Python保存屏幕截图(不使用PIL)

    起因 在极客学院讲授<使用Python编写远程控制程序>的课程中,涉及到查看被控制电脑屏幕截图的功能. 如果使用PIL,这个需求只需要三行代码: from PIL import Image ...

  9. Python编码记录

    字节流和字符串 当使用Python定义一个字符串时,实际会存储一个字节串: "abc"--[97][98][99] python2.x默认会把所有的字符串当做ASCII码来对待,但 ...

随机推荐

  1. <c:forEach>详解

    <c:forEach>详解 <c:forEach>标签的语法定义如下所示. <c:forEach var="name" items="exp ...

  2. Python返回多个值

    def get_abc(): a = 1 b = 2 c = 3 return a,b,c temp = get_abc() #temp = (1,2,3) a,b,c = get_abc() #a ...

  3. 【UML】NO.71.EBook.9.UML.4.002-【PowerDesigner 16 从入门到精通】- RQM

    1.0.0 Summary Tittle:[UML]NO.71.EBook.9.UML.4.002-[PowerDesigner 16 从入门到精通]-  RQM Style:DesignPatter ...

  4. .net core 配置

    .net core 配置包括很多种 例如内存变量.命令行参数.环境变量以及物理文件配置和自定义配置 物理文件配置主要有三种,它们分别是JSON.XML和INI,对应的配置源类型分别是JsonConfi ...

  5. windows将文件夹映射为虚拟磁盘

    subst X: e:123 将e盘下的123文件夹映射为x盘,123的容量即x盘容量 subst X: /t 删除映射的x盘

  6. Pycharm激活方法步骤

    Pycharm激活步骤 第一步:找到hosts文件 先按下键盘的win + r ,然后复制c:\windows\system32\drivers\etc粘贴到对话框回车打开文件管理器 第二步:修改ho ...

  7. [Android] TextView上同时显示图标和文字

    需求场景 +----------------------------+ | Icon TEXT | +----------------------------+ 当然,可以使用LineLayout,包 ...

  8. 创建react项目的几种方法

    前言: 构建React项目的几种方式: 构建:create-react-app 快速脚手架 构建:generator-react-webpack 构建:webpack一步一步构建 1)构建:creat ...

  9. 使用日期插件用js处理日期格式

    function compareDate(checkStartDate, checkEndDate) {    var arys1= new Array();    var arys2= new Ar ...

  10. java 原码反码及补码 总结

    参考: http://www.cnblogs.com/zhangziqiu/archive/2011/03/30/ComputerCode.html http://blog.csdn.net/lius ...