数据挖掘算法以及其实现zz
实验一 分类技术及其应用
实习要求: 基于线性回归模型拟合一个班学生的学习成绩,建立预测模型。数据可由自己建立100个学生的学习成绩。
1) 算法思想:
最小二乘法
设经验方程是y=F(x),方程中含有一些待定系数an,给出真实值{(xi,yi)|i=1,2,...n},将这些x,y值 代入方程然后作差,可以描述误差:yi-F(xi),为了考虑整体的误差,可以取平方和,之所以要平方是考虑到误差可正可负直接相加可以相互抵消,所以记 误差为:
e=∑(yi-F(xi))^2
它是一个多元函数,有an共n个未知量,现在要求的是最小值。所以必然满足对各变量的偏导等于0,于是得到n个方程:
de/da1=0
de/da2=0
...
de/dan=0
n个方程确定n个未知量为常量是理论上可以解出来的。用这种误差分析的方法进行回归方程的方法就是最小二乘法。
线性回归
如果经验方程是线性的,形如y=ax+b,就是线性回归。按上面的分析,误差函数为:
e=∑(yi-axi-b)^2
各偏导为:
de/da=2∑(yi-axi-b)xi=0
de/db=-2∑(yi-axi-b)=0
于是得到关于a,b的线性方程组:
(∑xi^2)a+(∑xi)b=∑yixi
(∑xi)a+nb=∑yi
设A=∑xi^2,B=∑xi,C=∑yixi,D=∑yi,则方程化为:
Aa+Bb=C
Ba+nb=D
解出a,b得:
a=(Cn-BD)/(An-BB)
b=(AD-CB)/(An-BB)
2) 编程实现算法
C++程序:
#include<iostream>
#include<math.h>
using namespace std;
void main()
{
double x,y,A=0.0,B=0.0,C=0.0,D=0.0,delta,a,b;
int n,sno,avgstudy;
cout<<"请拟合输入样本数目"<<endl;
cin>>n;
for(int i=0;i<n;++i)
{ cout<<"请输入第"<<i+1<<"个学生学号"<<endl;
cin>>sno;
cout<<"请输入学生上自习时间,按照每天小时计算"<<endl;
cin>>x;
cout<<"请输入学生请输入平均成绩"<<endl;
cin>>y;
A+=x*x;
B+=x;
C+=x*y;
D+=y;
}
delta=A*n-B*B;
a=((C*n-B*D)/delta);
b=((A*D-C*B)/delta);
cout<<"a="<<a<<"b="<<b<<endl;
if(fabs(delta)<1e-10)
{
cerr<<"Error!Divide by zero"<<endl;
}
else
{
cout<<"a="<<((C*n-B*D)/delta)<<endl
<<"b="<<((A*D-C*B)/delta)<<endl;
}
cout<<"输入您想预测的成绩,先输入平均日自习时间(小时)"<<endl;
cin>>avgstudy;
cout<<a*avgstudy+b;
}
}
3) 输出运算结果
输入是将各个同学的上自习的时间 按照小时计算
比如(4,85)(5,94),将成绩和上自习时间进行相应的线性回归
,推导出相应的线型方程,以便今后对其他学生上自习以及成绩的估测。
实习二 聚类技术及其应用
实习题1 编程验证单连接凝聚聚类算法,实验数据可使用第五章表5.2 的数据进行。要求输出层次聚类过程中每一步的聚类结果。
实习题2 利用K-均值聚类算法对如下数据进行聚类,其中输入K=3,数据集为
{
2,4,10,12,3,20,30,11,25,23,34,22} 。
要求输出每个类及其中的元素。
1)算法基本思想的描述
Given
k, the k-means algorithm is implemented in four steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
– Assign each object to the cluster with the nearest seed point
– Go back to Step 2, stop when no more new assignment
2)编程实现算法
//***********引入库函数
#include "iostream.h"
#include "math.h"
#include "stdlib.h"
#include "iomanip.h"
#include "time.h"
#include "fstream.h"
//*************定义常量
const int TRUE=1;
const int FALSE=0;
const int MarkovLengh=10000;
const int MaxInnerLoop=10000;
const int MaxOuterLoop=60;
const double CO=0.1;
const double DeclineRate=0.95;
const long MAX=100000;
const int AcceptRate=1;
const double ForceDecline=0.9;
//************定义全局变量
int DataNum; //聚类样本数目
int Dimension; //样本维数
int K; //分类数
double *DataSet; //指向浮点型的指针
int HALT=0;
int Row=3;
//***************************************************************
// 类GETDATA:设定全局变量,维数,样本数,和类别数等 ***
// 随机生成样本或手工输入样本的类 ***
//***************************************************************
class GETDATA{
public:
GETDATA();
void
Display();
void
Initial();
void
Input();
double
FRand(double,double);
double
rand1,rand2; //随机数的高低值
};
GETDATA::GETDATA()
{
int
i,j;
Dimension=2;
DataNum=50;
K=4;
DataSet=new
double[Dimension*DataNum];
for(i=0;i<DataNum;i++)
{
for(j=0;j<Dimension;j++)
DataSet[i*Dimension+j]=(((double)rand()/(double)RAND_MAX)*100);
}
}
//*****************显示当前待聚类的样本(维数,个数,类别数等)
void GETDATA::Display()
{
int
i,j;
cout<<" 当前样本集如下:"<<endl<<" {"<<endl;
for(i=0;i<DataNum;i++)
{
cout<<" [";
for(j=0;j<Dimension;j++)
{
cout<<"
"<<setw(8)<<DataSet[i*Dimension+j];
}
cout<<" ] ";
if((i+1)%Row==0)
cout<<endl;
}
cout<<endl<<" }"<<endl;
cout<<endl<<" 以上实数样本集由计算机在---100之间随机产,其中:"<<endl;
cout<<endl<<" 样本维数Dimension= "<<Dimension<<endl;
cout<<" 样本数 DataNum= "<<DataNum<<endl;
cout<<" 类别数 K= "<<K<<endl;
}
//****************输入待聚类样本,包括维数,个数,类别数等
void GETDATA::Input()
{
char
flag;
int
i,j;
double
s=0;
cout<<endl<<" 请依次输入: 维数 样本数目 类别数"<<endl;
cout<<endl<<" 维数Dimension: ";
cin>>Dimension;
cout<<endl<<" 样本数目DataNum: ";
cin>>DataNum;
cout<<endl<<" 类别数K:";
cin>>K;
cout<<endl<<" 随机生成数据输入R 人工输入按B: "<<endl; delete[]DataSet;
DataSet=new
double[Dimension*DataNum];
cin>>flag;
if(flag=='R'||flag=='r')
{
cout<<" 输入随机数生成范围(最小值和最大值):"
<<endl<<" 最小值:";
cin>>rand1;
cout<<endl<<" 最大值:";
cin>>rand2;
for(i=0;i<DataNum;i++)
{
for(j=0;j<Dimension;j++)
DataSet[i*Dimension+j]=FRand(rand1,rand2);
}
}
else
if(flag=='H'||flag=='h')
{
for(i=0;i<DataNum;i++)
{
cout<<endl<<" 请输入第"<<i+1<<" 个样本的"<<Dimension<<" 个分量";
for(j=0;j<Dimension;j++)
cin>>DataSet[i*Dimension+j];
}
}
else
cout<<endl<<" 非法数据!";
}
//****************初始化聚类样本
void GETDATA::Initial()
{
char
ch;
GETDATA::Display();
cout<<endl<<" 重新录入样本输入A 开始聚类B: ";
cin>>ch;
while(!(ch=='A'||ch=='a')&&!(ch=='B'||ch=='b'))
{
cout<<endl<<" 重新录入样本输入A 开始聚类B: ";
cin>>ch;
}
if(ch=='A'||ch=='a')
GETDATA::Input();
}
double GETDATA::FRand(double rand1,double rand2)
{
return
rand1+(double)(((double)rand()/(double)RAND_MAX)*(rand2-rand1));
}
//***********************************************************
// 类SSA:
K-均值算法的实现 ***
// 功能:根据设定的K,DataNum,Dimension等聚类 ***
//***********************************************************
class SAA
{
public:
struct
DataType
{
double
*data;
int
father;
double
*uncle;
};
struct
ClusterType
{
double
*center;
int
sonnum;
};
SAA();
void
Initialize();
void
KMeans();
void
SA( );
void
DisPlay();
void
GetDataset(DataType *p1,double *p2,int datanum,int dim);
void
GetValue(double *str1,double
*str2,int dim);
int FindFather(double
*p,int k);
double
SquareDistance(double *str1,double *str2,int
dim);
int Compare(double
*p1,double *p2,int
dim);
void
NewCenterPlus(ClusterType *p1,int t,double *p2,int dim);
void
NewCenterReduce(ClusterType *p1,int t,double *p2,int dim);
double
MaxFunc();
void
Generate(DataType *p1,ClusterType *c1);
double
Compare(DataType *p1,ClusterType *c1,DataType *p2,ClusterType *c2);
void
CopyStatus(DataType *p1,ClusterType *c1,DataType *p2,ClusterType *c2);
int SecondFather(DataType *p,int t,int k);
double
AimFunction(DataType *q,ClusterType *c);
double
FRand(double ,double);
void
KMeans1();
protected:
double Temp;
//double CO;
//double DeclineRate;
//int MarkovLengh;
//int MaxInnerLoop;
//int MaxOuterLoop;
double AimFunc;
DataType *DataMember, *KResult,*CurrentStatus,*NewStatus;
ClusterType * ClusterMember,*NewCluster,*CurrentCluster;
}; //end of class SAA
//************建立构造函数,初始化保护成员
SAA::SAA()
{
int
i;
// DeclineRate=(double)0.9;
// MarkovLengh=1000;
// MaxInnerLoop=200;
// MaxOuterLoop=10;
// CO=1;
DataMember=new DataType[DataNum];
ClusterMember=new ClusterType[K];
for(i=0;i<DataNum;i++)
{
DataMember[i].data=new double[Dimension];
DataMember[i].uncle=new double[K];
}
for(i=0;i<K;i++)
ClusterMember[i].center=new double[Dimension];
GetDataset(DataMember,DataSet,DataNum,Dimension);
}//endSAA
//****************初始化参数,及开始搜索状态
void SAA::Initialize( )
{
//K-均值聚类法建立退火聚类的初始状态
// KMeans();
}
//*******************k-均值法进行聚类
//************接口:数据,数量,维数,类别
//逐点聚类方式
void SAA::KMeans()
{
int
i,j,M=1;
int
pa,pb,fa;
ClusterType *OldCluster;
//初始化聚类中心
OldCluster=new ClusterType[K];
for(i=0;i<K;i++)
{
// cout<<endl<<i+1<<"中心:";
GetValue(ClusterMember[i].center,DataMember[i].data,Dimension);
ClusterMember[i].sonnum=1;
OldCluster[i].center=new double[Dimension];
GetValue(OldCluster[i].center,ClusterMember[i].center,Dimension);
}
for(i=0;i<DataNum;i++)
{
// cout<<endl<<i+1<<":
"<<ClusterMember[0].center[0]<<"
"<<ClusterMember[1].center[0]<<" son:
"<<ClusterMember[0].sonnum;
for(j=0;j<K;j++)
{
DataMember[i].uncle[j]=SquareDistance(DataMember[i].data,ClusterMember[j].center,Dimension);
// cout<<"
"<<i+1<<"->"<<j+1<<":
"<<DataMember[i].uncle[j];
//"类中心"<<ClusterMember[j].center[0]<<":
"<<DataMember[i].uncle[j]<<" ";
}
pa=DataMember[i].father=FindFather(DataMember[i].uncle,K);
if(i>=K)
{
// cout<<endl<<pa<<"
类样本数:"<<ClusterMember[pa].sonnum;
ClusterMember[pa].sonnum+=1;
// cout<<endl<<pa<<"
类样本数:"<<ClusterMember[pa].sonnum;
NewCenterPlus(ClusterMember,pa,DataMember[i].data,Dimension);
// cout<<endl<<i+1<<"->"<<pa+1<<"类
:"<<ClusterMember[pa].center[0];
GetValue(OldCluster[pa].center,ClusterMember[pa].center,Dimension);
}
}
//开始聚类,直到聚类中心不再发生变化。××逐个修改法××
while(!HALT)
{
//一次聚类循环:.重新归类;.修改类中心
for(i=0;i<DataNum;i++)
{
// cout<<endl;
for(j=0;j<K;j++)
{
// cout<<" D
"<<DataMember[i].data[0]<<"
"<<ClusterMember[j].center[0]<<" ";
DataMember[i].uncle[j]=SquareDistance(DataMember[i].data,ClusterMember[j].center,Dimension);
//
cout<<DataMember[i].data[0]<<"->"<<ClusterMember[0l].center[0]<<" :
"<<DataMember[i].uncle[0]<<endl;
// cout<<i+1<<"->"<<j+1<<"
"<<DataMember[i].uncle[j];
}
fa=DataMember[i].father;
if(fa!=FindFather(DataMember[i].uncle,K)&&ClusterMember[fa].sonnum>1)
{
pa=DataMember[i].father;
ClusterMember[pa].sonnum-=1;
pb=DataMember[i].father=FindFather(DataMember[i].uncle,K);
ClusterMember[pb].sonnum+=1;
NewCenterReduce(ClusterMember,pa,DataMember[i].data,Dimension);
NewCenterPlus(ClusterMember,pb,DataMember[i].data,Dimension);
/* cout<<endl<<"*********************"<<M<<"
次聚类:*****************"; //聚一次类输出一次结果
cout<<endl<<DataMember[i].data[0]<<"
in "<<pa+1<<"类-> "<<pb+1<<"类: ";
for(t=0;t<K;t++)
{
cout<<endl<<"
第"<<t+1
<<"类中心:
"<<ClusterMember[t].center[0]<<" 样本个数:"<<ClusterMember[t].sonnum;
}
DisPlay();
M=M+1;
*/
}
}//endfor
//判断聚类是否完成,HALT=1,停止聚类
HALT=0;
for(j=0;j<K;j++)
if(Compare(OldCluster[j].center,ClusterMember[j].center,Dimension))
break;
if(j==K)
HALT=1;
for(j=0;j<K;j++)
GetValue(OldCluster[j].center,ClusterMember[j].center,Dimension);
}//endwhile
}//end of KMeans
//批聚类方式
void SAA::KMeans1()
{
int
i,j,M=1;
int
pa,pb,fa;
ClusterType *OldCluster;
//初始化聚类中心
OldCluster=new ClusterType[K];
for(i=0;i<K;i++)
OldCluster[i].center=new double[Dimension];
for(j=0;j<K;j++)
GetValue(OldCluster[j].center,ClusterMember[j].center,Dimension);
//开始聚类,直到聚类中心不再发生变化。××逐个修改法××
while(!HALT)
{
//一次聚类循环:.重新归类;.修改类中心
for(i=0;i<DataNum;i++)
{
for(j=0;j<K;j++)
DataMember[i].uncle[j]=SquareDistance(DataMember[i].data,ClusterMember[j].center,Dimension);
fa=DataMember[i].father;
if(fa!=FindFather(DataMember[i].uncle,K)&&ClusterMember[fa].sonnum>1)
{
pa=DataMember[i].father;
ClusterMember[pa].sonnum-=1;
pb=DataMember[i].father=FindFather(DataMember[i].uncle,K);
ClusterMember[pb].sonnum+=1;
NewCenterReduce(ClusterMember,pa,DataMember[i].data,Dimension);
NewCenterPlus(ClusterMember,pb,DataMember[i].data,Dimension);
}
}//endfor
//判断聚类是否完成,HALT=1,停止聚类
HALT=0;
for(j=0;j<K;j++)
if(Compare(OldCluster[j].center,ClusterMember[j].center,Dimension))
break;
if(j==K)
HALT=1;
for(j=0;j<K;j++)
GetValue(OldCluster[j].center,ClusterMember[j].center,Dimension);
}//endwhile
}
//几个经常需要调用的小函数
void SAA::NewCenterPlus(ClusterType *p1,int
t,double *p2,int
dim)
{
int
i;
for(i=0;i<dim;i++)
p1[t].center[i]=p1[t].center[i]+(p2[i]-p1[t].center[i])/(p1[t].sonnum);
}
void SAA::NewCenterReduce(ClusterType *p1,int t,double *p2,int dim)
{
int
i;
for(i=0;i<dim;i++)
p1[t].center[i]=p1[t].center[i]+(p1[t].center[i]-p2[i])/(p1[t].sonnum);
}
void SAA::GetDataset(DataType *p1,double
*p2,int datanum,int
dim)
{
int
i,j;
for(i=0;i<datanum;i++)
{
for(j=0;j<dim;j++)
p1[i].data[j]=p2[i*dim+j];
}
}
void SAA::GetValue(double *str1,double *str2,int dim)
{
int
i;
for(i=0;i<dim;i++)
str1[i]=str2[i];
}
int SAA::FindFather(double *p,int k)
{
int
i,N=0;
double
min=30000;
for(i=0;i<k;i++)
if(p[i]<min)
{
min=p[i];
N=i;
}
return
N;
}
double SAA::SquareDistance(double
*str1,double *str2,int
dim)
{
double
dis=0;
int
i;
for(i=0;i<dim;i++)
dis=dis+(double)(str1[i]-str2[i])*(str1[i]-str2[i]);
return
dis;
}
int SAA::Compare(double *p1,double
*p2,int dim)
{
int
i;
for(i=0;i<dim;i++)
if(p1[i]!=p2[i])
return 1;
return
0;
}
double SAA::FRand(double a,double b)
{
return
a+(double)(((double)rand()/(double)RAND_MAX)*(b-a));
}
void SAA::DisPlay()
{
int
i,N,j,t;
ofstream result("聚类过程结果显示.txt",ios::ate);
for(i=0;i<K;i++)
{
N=0;
cout<<endl<<endl<<"******************** 第"<<i+1<<" 类样本:*******************"<<endl;
result<<endl<<endl<<"******************** 第"<<i+1<<" 类样本:*******************"<<endl;
for(j=0;j<DataNum;j++)
if(DataMember[j].father==i)
{
cout<<" [";
for(t=0;t<Dimension;t++)
cout<<" "<<setw(5)<<DataMember[j].data[t];
cout<<" ] ";
if((N+1)%Row==0)
cout<<endl;
result<<" [";
for(t=0;t<Dimension;t++)
result<<" "<<setw(5)<<DataMember[j].data[t];
result<<" ] ";
if((N+1)%Row==0)
result<<endl;
N=N+1;
}
}//end
for
cout<<endl<<endl<<" 聚类结果,总体误差准则函数:"<<AimFunction(DataMember,ClusterMember)<<endl;
result<<endl<<" 聚类结果,总体误差准则函数:"<<AimFunction(DataMember,ClusterMember)<<endl;
result.close();
}//end of Display
double SAA::AimFunction(DataType *q,ClusterType *c)
{
int
i,j;
double
*p;
p=new
double[K];
for(i=0;i<K;i++)
p[i]=0;
for(i=0;i<K;i++)
{
for(j=0;j<DataNum;j++)
if(q[j].father==i)
{
p[i]=p[i]+SquareDistance(c[i].center,q[j].data,Dimension);
}
}
AimFunc=0;
for(i=0;i<K;i++)
AimFunc=AimFunc+p[i];
return
AimFunc;
}
//************************************
// 主函数入口 ****
//************************************
void main()
{
//用户输入数据
srand((unsigned)time(NULL));
GETDATA getdata;
getdata.Initial();
ofstream file("聚类过程结果显示.txt",ios::trunc); //聚类结果存入“聚类结果显示.txt”文件中
//k-均值聚类方法聚类
SAA saa; //****此行不可与上行互换。
saa.KMeans(); //逐个样本聚类
// saa.KMeans1(); //批处理方式聚类,可以比较saa.KMeans()的区别
cout<<endl<<"***********************K-均值聚类结果:**********************";
file<<endl<<"***********************K-均值聚类结果:**********************"<<endl;
file.close();
saa.DisPlay();
cout<<endl<<" 程序运行结束!"<<endl;
}
}3)输出运算结果
实习三 关联规则挖掘及其应用
实习题:Apriori算法是一种最有影响的挖掘布尔关联规则频繁项集的算法。它将关联规则挖掘算法的设计分解为两个子问题:(1) 找到所有支持度大于最小支持度的项集,这些项集称被为频繁项集(Frequent Itemset)。(2) 使用第一步产生的频繁集产生期望的规则。
在图书馆管理系统中积累了大量的读者借还书的历史记录,基于Apriori算法挖掘最大频繁项目集,由此产生关联规则。数据格式可参阅文献
参考文献:彭仪普,熊拥军: 关联挖掘在文献借阅历史数据分析中的应用.情报杂志. 2005年第8期
1)
算法基本思想的描述
首先产生频繁1-项集L1,然后是频繁2-项集L2,直到有某个r值使得Lr为空,这时算法停止。这里在第k次循环中,过程先产生候选k-项集的集合Ck,Ck中的每一个项集是对两个只有一个项不同的属于Lk-1的频集做一个(k-2)-连接来 产生的。Ck中的项集是用来产生频集的候选集,最后的频集Lk必须是Ck的一个子集。Ck中的每个元素需在交易数据库中进行验证来决定其是否加入Lk,这 里的验证过程是算法性能的一个瓶颈。
为了生成所有频集,使用了递推的方法。其核心思想简要描述如下:
(1) L1 = {large 1-itemsets};
(2) for (k=2; Lk-1¹F; k++) do begin
(3) Ck=apriori-gen(Lk-1); //新的候选集
(4) for all transactions
tÎD do begin
(5) Ct=subset(Ck,t); //事务t中包含的候选集
(6) for all
candidates cÎ Ct do
(7) c.count++;
(8) end
(9) Lk={cÎ Ck
|c.count³minsup}
(10) end
(11) Answer=∪kLk;
1.Find all frequent itemsets: By
definition, each of these itemsets will occur at least as frequently as a predetermined minimum support
count, min sup
2. Generate strong association
rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum
confidence
3.Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
Method:
– generate length (k+1) candidate itemsets from two length k frequent itemsets which have K-1 kinds same itemsets,
and
– test the candidates against
DB
2)
编程实现算法
1. Item.h 源文件
/*----------------------------------------------------------------------
File : Item.h
Contents : itemset
management
Author : Bart Goethals
Update
: 4/4/2003
----------------------------------------------------------------------*/
#include <set>
using namespace std;
class Item
{
public:
Item(int i) : id(i), support(0), children(0) {}
Item(const Item &i) : id(i.id), support(i.support),
children(i.children) {}
~Item(){}
int
getId() const {return
id;}
int
Increment(int inc = 1) const
{return support+=inc;}
set<Item> *makeChildren() const;
int
deleteChildren() const;
int
getSupport() const {return
support;}
set<Item> *getChildren() const {return
children;}
bool
operator<(const
Item &i) const{return
id < i.id;}
private:
const
int id;
mutable
int support;
mutable
set<Item> *children;
};
2. AprioriRules.h 源文件
class Itemset
{
public:
Itemset(int l) : length(l) {t = new int[l];}
Itemset(const
Itemset &is) : length(is.length), support(is.support)
{
t = new
int[length];
for(int i=0;i<length;i++) t[i] = is.t[i];
}
~Itemset(){delete [] t;}
int
length;
int
*t;
int
support;
};
class AprioriRules
{
public:
AprioriRules();
~AprioriRules();
void
setData(char *fn);
int
setOutputRules(char *fn);
void
setMinConf(float mc){minconf=mc;}
int
generateRules();
void
setMaxHead(int m){maxhead=m;}
void
setVerbose(){verbose=true;}
private:
Itemset *getNextSet();
int
generateRules(set<Item> *current, int
*iset, int depth);
int
processSet(set<Item> *items, int sl, int *iset, int sup, int *head, int spos, int depth);
Item *trie;
float
minconf;
int
maxhead;
ofstream rulesout;
FILE *data;
bool
verbose;
};
3.AprioriRules.cpp源文件
/*----------------------------------------------------------------------
File : AprioriRules.cpp
Contents : apriori
algorithm for finding association rules
Author : Bart Goethals
Update : 16/04/2003
----------------------------------------------------------------------*/
#include <iostream>
#include <fstream>
#include <stdio.h>
#include <set>
#include <vector>
#include <time.h>
using namespace std;
#include "Item.h"
#include "AprioriRules.h"
AprioriRules::AprioriRules()
{
data=0;
minconf=0;
maxhead=0;
trie = new
Item(0);
verbose = false;
}
AprioriRules::~AprioriRules()
{
if(data)
fclose(data);
if(trie)
{
trie->deleteChildren();
delete
trie;
}
}
void AprioriRules::setData(char
*fn)
{
data = fopen(fn,"rt");
}
int AprioriRules::setOutputRules(char *fn)
{
rulesout.open(fn);
if(!rulesout.is_open())
{
cerr << "error: could not open " << fn
<< endl;
return
-1;
}
return
0;
}
Itemset *AprioriRules::getNextSet()
{
Itemset *t;
vector<int>
list;
char
c;
do
{
int
item=0, pos=0;
c = getc(data);
while((c
>= '0') && (c <= '9')) {
item *=10;
item += int(c)-int('0');
c = getc(data);
pos++;
}
if(pos)
list.push_back(item);
}while(c
!= '\n' && !feof(data));
if(feof(data))
return 0;
int
size = list.size() - 1;
if(size>=0)
{
t = new
Itemset(size);
t->support = list[size];
for(int i=0; i<size; i++) t->t[i] = list[i];
return
t;
}
else
return getNextSet();
}
int AprioriRules::generateRules()
{
int
size=0;
clock_t start;
// Read
all frequent itemsets
if(verbose)
cout << "reading frequent itemsets"
<< flush;
start = clock();
while(Itemset
*t = getNextSet()) {
set<Item>::iterator it;
set<Item>* items =
trie->makeChildren();
for(int depth=0;depth < t->length; depth++) {
it =
items->find(Item(t->t[depth]));
if(it
== items->end()) it = items->insert(Item(t->t[depth])).first;
items = it->makeChildren();
}
if(t->length)
it->Increment(t->support);
else
trie->Increment(t->support);
size = (t->length>size?
t->length : size);
delete
t;
}
if(verbose)
cout << "[" <<
(clock()-start)/double(CLOCKS_PER_SEC) <<
"s]" << endl << flush;
//
generate rules
if(verbose)
cout << "generating rules"
<< flush;
int
*iset = new int[size];
int
added = generateRules(trie->getChildren(), iset, 1);
delete
[] iset;
if(verbose)
cout << "[" <<
(clock()-start)/double(CLOCKS_PER_SEC) <<
"s]" << endl << flush;
return
added;
}
int AprioriRules::generateRules(set<Item> *current, int *iset, int depth)
{
if(current==0)
return 0;
int
added = 0;
for(set<Item>::iterator
runner = current->begin(); runner!= current->end(); runner++) {
iset[depth-1] =
runner->getId();
if(depth
> 1) {
int
*tmp = new int[depth];
added +=
processSet(trie->getChildren(), depth, iset, runner->getSupport(), tmp,
0,1);
delete
[] tmp;
}
added +=
generateRules(runner->getChildren(), iset, depth+1);
}
return
added;
}
int AprioriRules::processSet(set<Item> *items, int
sl, int *iset, int
sup, int *head, int
spos, int depth)
{
int
loper = spos;
set<Item>::iterator runner,
it;
int
added=0,i,j,k;
spos = sl;
while(--spos
>= loper) {
head[depth-1] = iset[spos];
runner =
items->find(Item(iset[spos]));
//
find body and its support
set<Item> *tmp =
trie->getChildren();
int
*body = new int[sl-depth];
for(i=j=k=0;
i<sl; i++) {
if(j<depth
&& iset[i]==head[j]) j++;
else
{
it =
tmp->find(Item(iset[i]));
tmp = it->getChildren();
body[k++] = iset[i];
}
}
// float intr =
(float(sup)*float(trie->getSupport()))/(float(runner->getSupport())*float(it->getSupport()));
float
conf = float(sup)/float(it->getSupport());
if(conf>=minconf)
{
for(i=0;
i<sl-depth; i++) rulesout << body[i] << "
";
rulesout << "=> ";
for(i=0;
i<depth; i++) rulesout << head[i] << "
";
rulesout << "(" << sup << ", " << conf << ")" << endl;
// rulesout
<< "(" << sup << ", " << conf
<< ", " << intr << ")" << endl;
added++;
}
delete
[] body;
if(conf>=minconf
&& depth<sl-1) {
if(maxhead)
{
if(depth<maxhead)
added += processSet(runner->getChildren(), sl, iset, sup, head, spos+1,
depth+1);
}
else
added += processSet(runner->getChildren(), sl, iset, sup, head, spos+1,
depth+1);
}
}
return
added;
}
4.Item.cpp
/*----------------------------------------------------------------------
File : Item.cpp
Contents : itemset
management
Author : Bart Goethals
Update : 4/4/2003
----------------------------------------------------------------------*/
#include "Item.h"
set<Item> *Item::makeChildren() const
{
if(children)
return children;
return
children = new set<Item>;
}
int Item::deleteChildren() const
{
int
deleted=0;
if(children)
{
for(set<Item>::iterator
it = children->begin(); it != children->end(); it++)
{
deleted +=
it->deleteChildren();
}
delete
children;
children = 0;
deleted++;
}
return
deleted;
}
5./*----------------------------------------------------------------------
File : aprioritest.cpp
Contents : apriori
algorithm for finding association rules
Author : Bart Goethals
Update : 16/04/2003
----------------------------------------------------------------------*/
#include <stdio.h>
#include <time.h>
#include <iostream>
#include <set>
#include <vector>
#include <fstream>
using namespace std;
#include "Item.h"
#include "AprioriRules.h"
int main(int argc, char
*argv[])
{
cout << "Apriori association rule mining implementation"
<< endl;
cout << "by Bart Goethals, 2000-2003" <<
endl;
cout << "http://www.cs.helsinki.fi/u/goethals/"
<< endl << endl;
if(argc
< 3) {
cerr << "usage: " << argv[0] << " setsfile minconf [output]" <<
endl;
}
else
{
clock_t start = clock();
AprioriRules r;
r.setVerbose();
r.setData(argv[1]);
r.setMinConf(atof(argv[2]));
// r.setMaxHead(1);
if(argc==4)
r.setOutputRules(argv[3]);
start = clock();
cout << "generating rules\n" << flush;
int
rules = r.generateRules();
cout << rules << "\t[" << (clock()-start)/double(CLOCKS_PER_SEC) << "s]" << endl;
if(argc==4)
cout << "Written to "
<< argv[3] << endl;
}
return
0;
}
3)输出运算结果
由于数据方面以及程序的一些问题,本实验的测试结果没能产生。
数据挖掘课程总结:
- 在这门比较新的课程中学到了许多的新概念.
- 学会了查找相关资料,自学能力进一步加强.
- 认识到了自身应用方面的缺陷,在以后的学习中抓紧时间改进.
- 在查阅资料的过程中学到了很多新的算法
- 进一步加强了编程方面的能力,掌握数据挖掘在现实世界中的应用的以及其重要作用
数据挖掘算法以及其实现zz的更多相关文章
- 【十大经典数据挖掘算法】PageRank
[十大经典数据挖掘算法]系列 C4.5 K-Means SVM Apriori EM PageRank AdaBoost kNN Naïve Bayes CART 我特地把PageRank作为[十大经 ...
- 【十大经典数据挖掘算法】EM
[十大经典数据挖掘算法]系列 C4.5 K-Means SVM Apriori EM PageRank AdaBoost kNN Naïve Bayes CART 1. 极大似然 极大似然(Maxim ...
- 【十大经典数据挖掘算法】AdaBoost
[十大经典数据挖掘算法]系列 C4.5 K-Means SVM Apriori EM PageRank AdaBoost kNN Naïve Bayes CART 1. 集成学习 集成学习(ensem ...
- 【十大经典数据挖掘算法】SVM
[十大经典数据挖掘算法]系列 C4.5 K-Means SVM Apriori EM PageRank AdaBoost kNN Naïve Bayes CART SVM(Support Vector ...
- 【十大经典数据挖掘算法】Naïve Bayes
[十大经典数据挖掘算法]系列 C4.5 K-Means SVM Apriori EM PageRank AdaBoost kNN Naïve Bayes CART 朴素贝叶斯(Naïve Bayes) ...
- 【十大经典数据挖掘算法】C4.5
[十大经典数据挖掘算法]系列 C4.5 K-Means SVM Apriori EM PageRank AdaBoost kNN Naïve Bayes CART 1. 决策树模型与学习 决策树(de ...
- 【十大经典数据挖掘算法】k-means
[十大经典数据挖掘算法]系列 C4.5 K-Means SVM Apriori EM PageRank AdaBoost kNN Naïve Bayes CART 1. 引言 k-means与kNN虽 ...
- 【十大经典数据挖掘算法】Apriori
[十大经典数据挖掘算法]系列 C4.5 K-Means SVM Apriori EM PageRank AdaBoost kNN Naïve Bayes CART 1. 关联分析 关联分析是一类非常有 ...
- 【十大经典数据挖掘算法】kNN
[十大经典数据挖掘算法]系列 C4.5 K-Means SVM Apriori EM PageRank AdaBoost kNN Naïve Bayes CART 1. 引言 顶级数据挖掘会议ICDM ...
随机推荐
- graphql cli 开发graphql api flow
作用 代码生成 schema 处理 脚手架应用创建 项目管理 安装cli npm install -g graphql-cli 初始化项目(使用.graphqlconfig管理) 以下为demo de ...
- Python yield详解***
yield的英文单词意思是生产,有时候感到非常困惑,一直没弄明白yield的用法. 只是粗略的知道yield可以用来为一个函数返回值塞数据,比如下面的例子: def addlist(alist): f ...
- webpack和webpack-dev-server安装配置
本文转载自:https://www.cnblogs.com/xuehaoyue/p/6410095.html 跟着Webpack傻瓜式指南(一)这个教程在安装webpack和webpack-dev-s ...
- 走迷宫(用队列bfs并输出走的路径)
#include <iostream> #include <stack> #include <string.h> #include <stdio.h> ...
- HDU 4438 Hunters
Hunters Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others)Total Sub ...
- [Java.web]JSTL 使用
<%@ page import="cn.itcast.domain.Person"%> <%@ page language="java" im ...
- laravel上传文件到七牛云存储
背景 最近在用PHP和laravel框架做一个图片网站,需要将图片存贮到云端,搜索下了对比了下功能,发现七牛云存储不错(主要小流量免费),便选择使用七牛作为图片存储空间. 要实现的功能很简单,选择本地 ...
- Maven的依赖机制介绍
以下内容引用自https://ayayui.gitbooks.io/tutorialspoint-maven/content/book/maven_manage_dependencies.html: ...
- 存在继承关系的Java类对象之间的类型转换(一)
类似于基本数据类型之间的强制类型转换. 存在继承关系的父类对象和子类对象之间也可以 在一定条件之下相互转换. 这种转换需要遵守以下原则: 1.子类对象可以被视为是其父类的一个对象2.父类对象不能被 ...
- 浅谈Laravel框架的CSRF
前文 CSRF攻击和漏洞的参考文章: http://www.cnblogs.com/hyddd/archive/2009/04/09/1432744.html Laravel默认是开启了CSRF功能, ...