fp-growth代码问题(Python)
网上的 python3 fp-growth代码每次在执行时可能会出现找出的频繁项集不一致的情况,这是因为每次执行代码时建的FP树可能不一致。
加了一行代码可以解决这个问题(第59行):先对 frequentItemsInRecord 按 key 的ASSIC码排序,然后再按照 key 的支持度(即value值)降序排列。
之所以这么做是因为 frequentItemsInRecord 中可能会出现支持度一样的项,如果不按ASSIC码先排一次的话,
有可能出现每次执行代码时 orderedFrequentItems (第60行)中相同支持度的项出现的顺序不一致,从而造成每次建的FP树不一致,导致找出的频繁项集不一致。
import pprint def loadDataSet():
dataSet = [['bread', 'milk', 'vegetable', 'fruit', 'eggs'],
['noodle', 'beef', 'pork', 'water', 'socks', 'gloves', 'shoes', 'rice'],
['socks', 'gloves'],
['bread', 'milk', 'shoes', 'socks', 'eggs'],
['socks', 'shoes', 'sweater', 'cap', 'milk', 'vegetable', 'gloves'],
['eggs', 'bread', 'milk', 'fish', 'crab', 'shrimp', 'rice']] return dataSet def transfer2FrozenDataSet(dataSet):
frozenDataSet = {}
for elem in dataSet:
frozenDataSet[frozenset(elem)] = 1 return frozenDataSet class TreeNode:
def __init__(self, nodeName, count, nodeParent):
self.nodeName = nodeName
self.count = count
self.nodeParent = nodeParent
self.nextSimilarItem = None
self.children = {} def increaseC(self, count):
self.count += count def createFPTree(frozenDataSet, minSupport):
# scan dataset at the first time, filter out items which are less than minSupport
headPointTable = {}
for items in frozenDataSet:
for item in items:
headPointTable[item] = headPointTable.get(item, 0) + frozenDataSet[items]
headPointTable = {
k: v
for k, v in headPointTable.items() if v >= minSupport
}
frequentItems = set(headPointTable.keys())
if len(frequentItems) == 0: return None, None for k in headPointTable:
headPointTable[k] = [headPointTable[k], None] fptree = TreeNode("null", 1, None)
# scan dataset at the second time, filter out items for each record
for items, count in frozenDataSet.items():
frequentItemsInRecord = {}
for item in items:
if item in frequentItems:
frequentItemsInRecord[item] = headPointTable[item][0]
if len(frequentItemsInRecord) > 0:
frequentItemsInRecord = sorted(frequentItemsInRecord.items(), key=lambda v: v[0])
orderedFrequentItems = [v[0] for v in sorted(frequentItemsInRecord, key=lambda v: v[1], reverse=True)]
updateFPTree(fptree, orderedFrequentItems, headPointTable, count) return fptree, headPointTable def updateFPTree(fptree, orderedFrequentItems, headPointTable, count):
# handle the first item
if orderedFrequentItems[0] in fptree.children:
fptree.children[orderedFrequentItems[0]].increaseC(count)
else:
fptree.children[orderedFrequentItems[0]] = TreeNode(orderedFrequentItems[0], count, fptree) # update headPointTable
if headPointTable[orderedFrequentItems[0]][1] == None:
headPointTable[orderedFrequentItems[0]][1] = fptree.children[orderedFrequentItems[0]]
else:
updateHeadPointTable(headPointTable[orderedFrequentItems[0]][1], fptree.children[orderedFrequentItems[0]])
# handle other items except the first item
if (len(orderedFrequentItems) > 1):
updateFPTree(fptree.children[orderedFrequentItems[0]], orderedFrequentItems[1::], headPointTable, count) def updateHeadPointTable(headPointBeginNode, targetNode):
while (headPointBeginNode.nextSimilarItem != None):
headPointBeginNode = headPointBeginNode.nextSimilarItem
headPointBeginNode.nextSimilarItem = targetNode def mineFPTree(headPointTable, prefix, frequentPatterns, minSupport):
# for each item in headPointTable, find conditional prefix path, create conditional fptree,
# then iterate until there is only one element in conditional fptree
headPointItems = [v[0] for v in sorted(headPointTable.items(), key=lambda v: v[1][0])]
if (len(headPointItems) == 0): return for headPointItem in headPointItems:
newPrefix = prefix.copy()
newPrefix.add(headPointItem)
support = headPointTable[headPointItem][0]
frequentPatterns[frozenset(newPrefix)] = support prefixPath = getPrefixPath(headPointTable, headPointItem)
if (prefixPath != {}):
conditionalFPtree, conditionalHeadPointTable = createFPTree(prefixPath, minSupport)
if conditionalHeadPointTable != None:
mineFPTree(conditionalHeadPointTable, newPrefix, frequentPatterns, minSupport) def getPrefixPath(headPointTable, headPointItem):
prefixPath = {}
beginNode = headPointTable[headPointItem][1]
prefixs = ascendTree(beginNode)
if ((prefixs != [])):
prefixPath[frozenset(prefixs)] = beginNode.count while (beginNode.nextSimilarItem != None):
beginNode = beginNode.nextSimilarItem
prefixs = ascendTree(beginNode)
if (prefixs != []):
prefixPath[frozenset(prefixs)] = beginNode.count return prefixPath def ascendTree(treeNode):
prefixs = []
while ((treeNode.nodeParent != None) and (treeNode.nodeParent.nodeName != 'null')):
treeNode = treeNode.nodeParent
prefixs.append(treeNode.nodeName) return prefixs def rulesGenerator(frequentPatterns, minConf, rules):
for frequentset in frequentPatterns:
if (len(frequentset) > 1):
getRules(frequentset, frequentset, rules, frequentPatterns, minConf) def removeStr(set, str):
tempSet = []
for elem in set:
if (elem != str):
tempSet.append(elem)
tempFrozenSet = frozenset(tempSet) return tempFrozenSet def getRules(frequentset, currentset, rules, frequentPatterns, minConf):
for frequentElem in currentset:
subSet = removeStr(currentset, frequentElem)
confidence = frequentPatterns[frequentset] / frequentPatterns[subSet]
if (confidence >= minConf):
flag = False
for rule in rules:
if (rule[0] == subSet and rule[1] == frequentset - subSet):
flag = True if (flag == False):
rules.append((subSet, frequentset - subSet, confidence)) if (len(subSet) >= 2):
getRules(frequentset, subSet, rules, frequentPatterns, minConf) if __name__ == '__main__':
dataSet = loadDataSet()
frozenDataSet = transfer2FrozenDataSet(dataSet)
minSupport = 3
fptree, headPointTable = createFPTree(frozenDataSet, minSupport)
frequentPatterns = {}
prefix = set([])
mineFPTree(headPointTable, prefix, frequentPatterns, minSupport)
print("frequent patterns:")
pprint.pprint(frequentPatterns) minConf = 0.6
rules = []
rulesGenerator(frequentPatterns, minConf, rules)
print("association rules:")
pprint.pprint(rules)
print('rules num:', len(rules))
fp-growth代码问题(Python)的更多相关文章
- FP—Growth算法
FP_growth算法是韩家炜老师在2000年提出的关联分析算法,该算法和Apriori算法最大的不同有两点: 第一,不产生候选集,第二,只需要两次遍历数据库,大大提高了效率,用31646条测试记录, ...
- FP - growth 发现频繁项集
FP - growth是一种比Apriori更高效的发现频繁项集的方法.FP是frequent pattern的简称,即常在一块儿出现的元素项的集合的模型.通过将数据集存储在一个特定的FP树上,然后发 ...
- 编写高质量代码--改善python程序的建议(六)
原文发表在我的博客主页,转载请注明出处! 建议二十八:区别对待可变对象和不可变对象 python中一切皆对象,每一个对象都有一个唯一的标识符(id()).类型(type())以及值,对象根据其值能否修 ...
- 编写高质量代码--改善python程序的建议(八)
原文发表在我的博客主页,转载请注明出处! 建议四十一:一般情况下使用ElementTree解析XML python中解析XML文件最广为人知的两个模块是xml.dom.minidom和xml.sax, ...
- 编写高质量代码--改善python程序的建议(三)
原文发表在我的博客主页,转载请注明出处! 建议十三:警惕eval()的安全漏洞 相信经常处理文本数据的同学对eval()一定是欲罢不能,他的使用非常简单: eval("1+1==2" ...
- 抓取oschina上面的代码分享python块区下的 标题和对应URL
# -*- coding=utf-8 -*- import requests,re from lxml import etree import sys reload(sys) sys.setdefau ...
- Frequent Pattern 挖掘之二(FP Growth算法)(转)
FP树构造 FP Growth算法利用了巧妙的数据结构,大大降低了Aproir挖掘算法的代价,他不需要不断得生成候选项目队列和不断得扫描整个数据库进行比对.为了达到这样的效果,它采用了一种简洁的数据结 ...
- 编写高质量代码改善python程序91个建议学习01
编写高质量代码改善python程序91个建议学习 第一章 建议1:理解pythonic的相关概念 狭隘的理解:它是高级动态的脚本编程语言,拥有很多强大的库,是解释从上往下执行的 特点: 美胜丑,显胜隐 ...
- 关联规则算法之FP growth算法
FP树构造 FP Growth算法利用了巧妙的数据结构,大大降低了Aproir挖掘算法的代价,他不需要不断得生成候选项目队列和不断得扫描整个数据库进行比对.为了达到这样的效果,它采用了一种简洁的数据结 ...
- Frequent Pattern (FP Growth算法)
FP树构造 FP Growth算法利用了巧妙的数据结构,大大降低了Aproir挖掘算法的代价,他不需要不断得生成候选项目队列和不断得扫描整个数据库进行比对.为了达 到这样的效果,它采用了一种简洁的数据 ...
随机推荐
- pyhanlp的安装
github 的官方地址:https://github.com/hankcs/pyhanlp conda install -c conda-forge jpype1 pip install pyhan ...
- vue父组件引用多个相同的子组件传值
没有什么问题是for 解决不了的,我一直深信这句话,当然这句话也是我说的 父组件引用多个相同的子组件传值问题 (这种情况很少遇到) 1 <template> 2 <div> 3 ...
- 《2017年-2018年中国MES软件及服务市场研究报告》正式发布!
<2017年-2018年中国MES软件及服务市场研究报告>由e-works Research研究编写,报告深度分析了2017年及2018年中国MES市场发展状况,从市场规模.市场特点.需求 ...
- CTF-代码审计(3)..实验吧——你真的会PHP吗
连接:http://ctf5.shiyanbar.com/web/PHP/index.php 根据题目应该就是代码审计得题,进去就是 日常工具扫一下,御剑和dirsearch.py 无果 抓包,发现返 ...
- Linux shell if条件判断1
shell 逻辑控制语句: 分支判断结构 if case 循环结构 for while unt ...
- Linux使用pt-archiver工具自动备份MySQL
操作系统: CentOS 6.9 脚本语言: shell https://github.com/iscongyang/Practical/blob/master/shell-scripts/pt-ar ...
- 洛谷P1192-台阶问题(线性递推 扩展斐波那契)
占坑 先贴上AC代码 回头来补坑 #include <iostream> using namespace std; int n, k; const int mod = 100003; lo ...
- jvm堆、栈、String常量池
原文地址:http://blog.csdn.net/tophawk/article/details/78704074 程序计数器:它的生命周期与线程相同,线程私有.较小的内存区域,用以完成分支.循环. ...
- linux command lynx
[Purpose] Learning linux command lynx [Eevironment] Ubuntu 16.04 terminal apt-get ...
- 使用flow提升js代码的健壮性
https://www.jianshu.com/p/7716dc8b2206 Flow基本语法及使用 https://www.cnblogs.com/tianxiangbing/p/flow.h ...