相对于机器学习,关联规则的apriori算法更偏向于数据挖掘。

1) 测试文档中调用weka的关联规则apriori算法,如下

try {
File file = new File("F:\\tools/lib/data/contact-lenses.arff");
ArffLoader loader = new ArffLoader();
loader.setFile(file);
Instances m_instances = loader.getDataSet(); Discretize discretize = new Discretize();
discretize.setInputFormat(m_instances);
m_instances = Filter.useFilter(m_instances, discretize);
Apriori apriori = new Apriori();
apriori.buildAssociations(m_instances);
System.out.println(apriori.toString());
} catch (Exception e) {
e.printStackTrace();
}

步骤

1 读取数据集data,并提取样本集instances

2 离散化属性Discretize

3 创建Apriori 关联规则模型

4 输出大频率项集和关联规则集

2) 创建分类器的时候,调用设置默认参数方法

  public void resetOptions() {

    m_removeMissingCols = false;
m_verbose = false;
m_delta = 0.05;
m_minMetric = 0.90;
m_numRules = ;
m_lowerBoundMinSupport = 0.1;
m_upperBoundMinSupport = 1.0;
m_significanceLevel = -;
m_outputItemSets = false;
m_car = false;
m_classIndex = -;
}

参数详细解析,见后面的备注1

3)buildAssociations方法的解析,源码如下

public void buildAssociations(Instances instances) throws Exception {

    double[] confidences, supports;
int[] indices;
FastVector[] sortedRuleSet;
int necSupport = ; instances = new Instances(instances); if (m_removeMissingCols) {
instances = removeMissingColumns(instances);
}
if (m_car && m_metricType != CONFIDENCE)
throw new Exception("For CAR-Mining metric type has to be confidence!"); // only set class index if CAR is requested
if (m_car) {
if (m_classIndex == -) {
instances.setClassIndex(instances.numAttributes() - );
} else if (m_classIndex <= instances.numAttributes() && m_classIndex > ) {
instances.setClassIndex(m_classIndex - );
} else {
throw new Exception("Invalid class index.");
}
} // can associator handle the data?
getCapabilities().testWithFail(instances); m_cycles = ; // make sure that the lower bound is equal to at least one instance
double lowerBoundMinSupportToUse = (m_lowerBoundMinSupport
* instances.numInstances() < 1.0) ? 1.0 / instances.numInstances()
: m_lowerBoundMinSupport; if (m_car) {
// m_instances does not contain the class attribute
m_instances = LabeledItemSet.divide(instances, false); // m_onlyClass contains only the class attribute
m_onlyClass = LabeledItemSet.divide(instances, true);
} else
m_instances = instances; if (m_car && m_numRules == Integer.MAX_VALUE) {
// Set desired minimum support
m_minSupport = lowerBoundMinSupportToUse;
} else {
// Decrease minimum support until desired number of rules found.
m_minSupport = m_upperBoundMinSupport - m_delta;
m_minSupport = (m_minSupport < lowerBoundMinSupportToUse) ? lowerBoundMinSupportToUse
: m_minSupport;
} do { // Reserve space for variables
m_Ls = new FastVector();
m_hashtables = new FastVector();
m_allTheRules = new FastVector[];
m_allTheRules[] = new FastVector();
m_allTheRules[] = new FastVector();
m_allTheRules[] = new FastVector();
if (m_metricType != CONFIDENCE || m_significanceLevel != -) {
m_allTheRules[] = new FastVector();
m_allTheRules[] = new FastVector();
m_allTheRules[] = new FastVector();
}
sortedRuleSet = new FastVector[];
sortedRuleSet[] = new FastVector();
sortedRuleSet[] = new FastVector();
sortedRuleSet[] = new FastVector();
if (m_metricType != CONFIDENCE || m_significanceLevel != -) {
sortedRuleSet[] = new FastVector();
sortedRuleSet[] = new FastVector();
sortedRuleSet[] = new FastVector();
}
if (!m_car) {
// Find large itemsets and rules
findLargeItemSets();
if (m_significanceLevel != - || m_metricType != CONFIDENCE)
findRulesBruteForce();
else
findRulesQuickly();
} else {
findLargeCarItemSets();
findCarRulesQuickly();
} // prune rules for upper bound min support
if (m_upperBoundMinSupport < 1.0) {
pruneRulesForUpperBoundSupport();
} int j = m_allTheRules[].size() - ;
supports = new double[m_allTheRules[].size()];
for (int i = ; i < (j + ); i++)
supports[j - i] = ((double) ((ItemSet) m_allTheRules[]
.elementAt(j - i)).support()) * (-);
indices = Utils.stableSort(supports);
for (int i = ; i < (j + ); i++) {
sortedRuleSet[].addElement(m_allTheRules[].elementAt(indices[j - i]));
sortedRuleSet[].addElement(m_allTheRules[].elementAt(indices[j - i]));
sortedRuleSet[].addElement(m_allTheRules[].elementAt(indices[j - i]));
if (m_metricType != CONFIDENCE || m_significanceLevel != -) {
sortedRuleSet[].addElement(m_allTheRules[]
.elementAt(indices[j - i]));
sortedRuleSet[].addElement(m_allTheRules[]
.elementAt(indices[j - i]));
sortedRuleSet[].addElement(m_allTheRules[]
.elementAt(indices[j - i]));
}
} // Sort rules according to their confidence
m_allTheRules[].removeAllElements();
m_allTheRules[].removeAllElements();
m_allTheRules[].removeAllElements();
if (m_metricType != CONFIDENCE || m_significanceLevel != -) {
m_allTheRules[].removeAllElements();
m_allTheRules[].removeAllElements();
m_allTheRules[].removeAllElements();
}
confidences = new double[sortedRuleSet[].size()];
int sortType = + m_metricType; for (int i = ; i < sortedRuleSet[].size(); i++)
confidences[i] = ((Double) sortedRuleSet[sortType].elementAt(i))
.doubleValue();
indices = Utils.stableSort(confidences);
for (int i = sortedRuleSet[].size() - ; (i >= (sortedRuleSet[].size() - m_numRules))
&& (i >= ); i--) {
m_allTheRules[].addElement(sortedRuleSet[].elementAt(indices[i]));
m_allTheRules[].addElement(sortedRuleSet[].elementAt(indices[i]));
m_allTheRules[].addElement(sortedRuleSet[].elementAt(indices[i]));
if (m_metricType != CONFIDENCE || m_significanceLevel != -) {
m_allTheRules[].addElement(sortedRuleSet[].elementAt(indices[i]));
m_allTheRules[].addElement(sortedRuleSet[].elementAt(indices[i]));
m_allTheRules[].addElement(sortedRuleSet[].elementAt(indices[i]));
}
} if (m_verbose) {
if (m_Ls.size() > ) {
System.out.println(toString());
}
} if (m_minSupport == lowerBoundMinSupportToUse
|| m_minSupport - m_delta > lowerBoundMinSupportToUse)
m_minSupport -= m_delta;
else
m_minSupport = lowerBoundMinSupportToUse; necSupport = Math.round((float) ((m_minSupport * m_instances
.numInstances()) + 0.5)); m_cycles++;
} while ((m_allTheRules[].size() < m_numRules)
&& (Utils.grOrEq(m_minSupport, lowerBoundMinSupportToUse))
/* (necSupport >= lowerBoundNumInstancesSupport) */
/* (Utils.grOrEq(m_minSupport, m_lowerBoundMinSupport)) */&& (necSupport >= ));
m_minSupport += m_delta;
}

主要步骤解析:

1 使用removeMissingColumns方法,删除缺失属性的列

2 如果参数m_car是真,则进行划分;因为m_car是真的意思是挖掘与关联规则的有关的规则,所以划分成两部分,一部分有关,一部分无关,删除无关的即可;

3 方法findLargeItemSets查找大频率项集;具体源码见下面

4 方法findRulesQuickly查找所有的关联规则集;

5 方法pruneRulesForUpperBoundSupport删除不满足最小置信度的规则集;

6)按照置信度把规则集排序;

4)查找大频率项集findLargeItemSets源码如下:

private void findLargeItemSets() throws Exception {

    FastVector kMinusOneSets, kSets;
Hashtable hashtable;
int necSupport, necMaxSupport, i = ; // Find large itemsets // minimum support
necSupport = (int) (m_minSupport * m_instances.numInstances() + 0.5);
necMaxSupport = (int) (m_upperBoundMinSupport * m_instances.numInstances() + 0.5); kSets = AprioriItemSet.singletons(m_instances);
AprioriItemSet.upDateCounters(kSets, m_instances);
kSets = AprioriItemSet.deleteItemSets(kSets, necSupport,
m_instances.numInstances());
if (kSets.size() == )
return;
do {
m_Ls.addElement(kSets);
kMinusOneSets = kSets;
kSets = AprioriItemSet.mergeAllItemSets(kMinusOneSets, i,
m_instances.numInstances());
hashtable = AprioriItemSet.getHashtable(kMinusOneSets,
kMinusOneSets.size());
m_hashtables.addElement(hashtable);
kSets = AprioriItemSet.pruneItemSets(kSets, hashtable);
AprioriItemSet.upDateCounters(kSets, m_instances);
kSets = AprioriItemSet.deleteItemSets(kSets, necSupport,
m_instances.numInstances());
i++;
} while (kSets.size() > );
}

主要步骤:

1  类AprioriItemSet.singletons方法,将给定数据集的头信息转换成一个项集的集合, 头信息中的值的顺序是按字典序。

2 方法upDateCounters查找一频繁项目集;

3 AprioriItemSet.deleteItemSets方法,删除不满足支持度区间的项目集;

4 使用方法mergeAllItemSets(源码如下)由k-1项目集循环生出k频繁项目集,并且使用方法deleteItemSets删除不满足支持度区间的项目集;

5)由k-1项目集循环生出k频繁项目集的方法mergeAllItemSets,源码如下:

public static FastVector mergeAllItemSets(FastVector itemSets, int size,
int totalTrans) { FastVector newVector = new FastVector();
ItemSet result;
int numFound, k; for (int i = ; i < itemSets.size(); i++) {
ItemSet first = (ItemSet) itemSets.elementAt(i);
out: for (int j = i + ; j < itemSets.size(); j++) {
ItemSet second = (ItemSet) itemSets.elementAt(j);
result = new AprioriItemSet(totalTrans);
result.m_items = new int[first.m_items.length]; // Find and copy common prefix of size 'size'
numFound = ;
k = ;
while (numFound < size) {
if (first.m_items[k] == second.m_items[k]) {
if (first.m_items[k] != -)
numFound++;
result.m_items[k] = first.m_items[k];
} else
break out;
k++;
} // Check difference
while (k < first.m_items.length) {
if ((first.m_items[k] != -) && (second.m_items[k] != -))
break;
else {
if (first.m_items[k] != -)
result.m_items[k] = first.m_items[k];
else
result.m_items[k] = second.m_items[k];
}
k++;
}
if (k == first.m_items.length) {
result.m_counter = ;
newVector.addElement(result);
}
}
}
return newVector;
}

调用方法generateRules生出关联规则

6)生出关联规则的方法generateRules,源码如下

public FastVector[] generateRules(double minConfidence,
FastVector hashtables, int numItemsInSet) { FastVector premises = new FastVector(), consequences = new FastVector(), conf = new FastVector();
FastVector[] rules = new FastVector[], moreResults;
AprioriItemSet premise, consequence;
Hashtable hashtable = (Hashtable) hashtables.elementAt(numItemsInSet - ); // Generate all rules with one item in the consequence.
for (int i = ; i < m_items.length; i++)
if (m_items[i] != -) {
premise = new AprioriItemSet(m_totalTransactions);
consequence = new AprioriItemSet(m_totalTransactions);
premise.m_items = new int[m_items.length];
consequence.m_items = new int[m_items.length];
consequence.m_counter = m_counter; for (int j = ; j < m_items.length; j++)
consequence.m_items[j] = -;
System.arraycopy(m_items, , premise.m_items, , m_items.length);
premise.m_items[i] = -; consequence.m_items[i] = m_items[i];
premise.m_counter = ((Integer) hashtable.get(premise)).intValue();
premises.addElement(premise);
consequences.addElement(consequence);
conf.addElement(new Double(confidenceForRule(premise, consequence)));
}
rules[] = premises;
rules[] = consequences;
rules[] = conf;
pruneRules(rules, minConfidence); // Generate all the other rules
moreResults = moreComplexRules(rules, numItemsInSet, , minConfidence,
hashtables);
if (moreResults != null)
for (int i = ; i < moreResults[].size(); i++) {
rules[].addElement(moreResults[].elementAt(i));
rules[].addElement(moreResults[].elementAt(i));
rules[].addElement(moreResults[].elementAt(i));
}
return rules;
}

几个我想说的

1)不想输出为0的项,可以设置成缺失值?,因为算法会自动删除缺失值的列,不参与关联规则的生成;

2)按照置信度对关联规则排序,是关联规则分类器中使用的,只是提取关联规则,不需要排序;

备注

1)weka的关联规则中参数的详解

1.        car 如果设为真,则会挖掘类关联规则而不是全局关联规则。也就是只保留与类标签有关的关联规则,设置索引为-1
. classindex 类属性索引。如果设置为-,最后的属性被当做类属性。
. delta 以此数值为迭代递减单位。不断减小支持度直至达到最小支持度或产生了满足数量要求的规则。
. lowerBoundMinSupport 最小支持度下界。
. metricType 度量类型。设置对规则进行排序的度量依据。可以是:置信度(类关联规则只能用置信度挖掘),提升度(lift),杠杆率(leverage),确信度(conviction)。
在 Weka中设置了几个类似置信度(confidence)的度量来衡量规则的关联程度,它们分别是:
a) Lift : P(A,B)/(P(A)P(B)) Lift=1时表示A和B独立。这个数越大(>),越表明A和B存在于一个购物篮中不是偶然现象,有较强的关联度.
b) Leverage :P(A,B)-P(A)P(B)Leverage=0时A和B独立,Leverage越大A和B的关系越密切
c) Conviction:P(A)P(!B)/P(A,!B) (!B表示B没有发生) Conviction也是用来衡量A和B的独立性。从它和lift的关系(对B取反,代入Lift公式后求倒数)可以看出,这个值越大, A、B越关联。
. minMtric 度量的最小值。
. numRules 要发现的规则数。
. outputItemSets 如果设置为真,会在结果中输出项集。
. removeAllMissingCols 移除全部为缺省值的列。 . significanceLevel 重要程度。重要性测试(仅用于置信度)。 . upperBoundMinSupport 最小支持度上界。 从这个值开始迭代减小最小支持度。 . verbose 如果设置为真,则算法会以冗余模式运行。

2)控制台输出结果

Apriori
======= Minimum support: 0.2 ( instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: Generated sets of large itemsets: Size of set of large itemsets L(): Size of set of large itemsets L(): Size of set of large itemsets L(): Best rules found: . tear-prod-rate=reduced ==> contact-lenses=none conf:()
. spectacle-prescrip=myope tear-prod-rate=reduced ==> contact-lenses=none conf:()
. spectacle-prescrip=hypermetrope tear-prod-rate=reduced ==> contact-lenses=none conf:()
. astigmatism=no tear-prod-rate=reduced ==> contact-lenses=none conf:()
. astigmatism=yes tear-prod-rate=reduced ==> contact-lenses=none conf:()
. contact-lenses=soft ==> astigmatism=no conf:()
. contact-lenses=soft ==> tear-prod-rate=normal conf:()
. tear-prod-rate=normal contact-lenses=soft ==> astigmatism=no conf:()
. astigmatism=no contact-lenses=soft ==> tear-prod-rate=normal conf:()
. contact-lenses=soft ==> astigmatism=no tear-prod-rate=normal conf:()

转置请注明出处:http://www.cnblogs.com/rongyux/

数据挖掘:关联规则的apriori算法在weka的源码分析的更多相关文章

  1. SURF算法与源码分析、下

    上一篇文章 SURF算法与源码分析.上 中主要分析的是SURF特征点定位的算法原理与相关OpenCV中的源码分析,这篇文章接着上篇文章对已经定位到的SURF特征点进行特征描述.这一步至关重要,这是SU ...

  2. mahout算法源码分析之Collaborative Filtering with ALS-WR (四)评价和推荐

    Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit. 首先来总结一下 mahout算法源码分析之Collaborative Filtering with AL ...

  3. mahout算法源码分析之Collaborative Filtering with ALS-WR拓展篇

    Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit. 额,好吧,心头的一块石头总算是放下了.关于Collaborative Filtering with AL ...

  4. mahout算法源码分析之Collaborative Filtering with ALS-WR 并行思路

    Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit. mahout算法源码分析之Collaborative Filtering with ALS-WR 这个算 ...

  5. OpenCV学习笔记(27)KAZE 算法原理与源码分析(一)非线性扩散滤波

    http://blog.csdn.net/chenyusiyuan/article/details/8710462 OpenCV学习笔记(27)KAZE 算法原理与源码分析(一)非线性扩散滤波 201 ...

  6. Mahout源码分析:并行化FP-Growth算法

    FP-Growth是一种常被用来进行关联分析,挖掘频繁项的算法.与Aprior算法相比,FP-Growth算法采用前缀树的形式来表征数据,减少了扫描事务数据库的次数,通过递归地生成条件FP-tree来 ...

  7. diff.js 列表对比算法 源码分析

    diff.js列表对比算法 源码分析 npm上的代码可以查看 (https://www.npmjs.com/package/list-diff2) 源码如下: /** * * @param {Arra ...

  8. Go合集,gRPC源码分析,算法合集

    年初时,朋友圈见到的最多的就是新的一年新的FlAG,年末时朋友圈最多的也是xxxx就要过去了,你的FLAG实现了吗? 这个公众号2016就已经创建了,但截至今年之前从来没发表过文章,现在想想以前很忙, ...

  9. Ribbon源码分析(一)-- RestTemplate 以及自定义负载均衡算法

    如果只是想看ribbon的自定义负载均衡配置,请查看: https://www.cnblogs.com/yangxiaohui227/p/13186004.html 注意: 1.RestTemplat ...

随机推荐

  1. C++ std::thread概念介绍

    C++ 11新标准中,正式的为该语言引入了多线程概念.新标准提供了一个线程库thread,通过创建一个thread对象来管理C++程序中的多线程. 本文简单聊一下C++多线程相关的一些概念及threa ...

  2. 关于WebApi的跨域问题

    前端调用我后端接口时出现200,跨域问题 解决方案: 在webconfig中加入以下配置就OK了 <configuration> <system.webServer> < ...

  3. [LeetCode]Power of N

    题目:Power of Two Given an integer, write a function to determine if it is a power of two. 题意:判断一个数是否是 ...

  4. Python 2.X和3.X主要区别和下载安装

    一.python 2.X和3.X的区别 https://wenda.so.com/q/1459639143721779?src=140 二.Python的下载安装 1.Python下载 在python ...

  5. mybatis中Insert后主键返回

    1.Mapper的写法,返回的这个int是受影响的行号 int insertNewUser(User newUser); 2.xml的写法 <!--返回主键 形式1 --> <ins ...

  6. Leetcode 121.买股票的最佳时机

    题目描述: 给定一个数组,它的第 i 个元素是一支给定股票第 i 天的价格. 如果你最多只允许完成一笔交易(即买入和卖出一支股票),设计一个算法来计算你所能获取的最大利润. 注意你不能在买入股票前卖出 ...

  7. .NET Core应用的三种部署方式

    .NET Core应用提供了三种部署方式: FDD FDD:Framework-dependent deployment,框架依赖部署.这种方式针对某个特定版本的.NET Core进行发布,只打包应用 ...

  8. helm部署Filebeat + ELK

    helm部署Filebeat + ELK 系统架构图: 1) 多个Filebeat在各个Node进行日志采集,然后上传至Logstash 2) 多个Logstash节点并行(负载均衡,不作为集群),对 ...

  9. 根据vue-cli手摸手实现一个自己的脚手架

    故事背景 身为一个入门前端七个月的小菜鸡,在我入门前端的第一天就接触到了vue,并且死皮赖脸的跟他打了这么久的交到,还记得第一次用vue init webpack 这句命令一下生成一个模板的时候那种心 ...

  10. java基础之和String相关的一些转换

      String虽然不是java的基本数据类型,但使用的频率却非常之高,可以说是很常见了. 列举几个常见的关于String的转换,写的有点过于简洁,欢迎纠错和补充   1.Object和String的 ...