数据挖掘：WAP-Tree与PLWAP-Tree

简介

我们首先应该从WAP-Tree说起，下面一段话摘自《Effective Web Log Mining using WAP Tree-Mine》原文

Abstract -World Wide Web is a huge data repository and is growing with the explosive rate of about 1 million pages a day,web log records each access of the web
page and number of entries in the web logs is increasing rapidly. These web logs,when mined properly can provide useful information for decision-making. Sequential pattern mining discovers frequent user access patterns from web logs. Since Apriori-like sequential
pattern mining techniques requires expensive multiple scans of database. But, recently a novel data structure, known as Web Access Pattern Tree (or WAP-tree), was developed. This proposed method an efficient WAP-tree mining algorithm,known as DLT-mine (Doubly
Linked Tree algorithm). Proposed recursive algorithm uses this doubly Linked tree to efficiently find all access patterns that satisfy user specified criteria. This mining algorithm is faster than the other Apriori-based mining algorithms.

这段话的大致意思就是：互联网数据十分巨大，如今每天大约有1百万的网页访问增加量，并且增长十分迅速。而这些访问的信息可能给我们提供许多有用的信息并且用来制定相应的决策，并且传统的搜索信息的方法并不是十分高效，于是后来一个神奇的数据结构，称为“Web Access Pattern Tree”（简称WAP-Tree）被人们提出来了。显然，它具有更优秀的性质，并比以往的方法要更快。

后来在实现WAP-Tree的算法的过程中，人们发现WAP-Tree在搜索频繁项的过程中还可以更进一步的优化，于是人们将它改进后成为“Pre-Order Linked WAP-Tree”（简称PLWAP-Tree），具体内容我们会在下面陈述。

WAP-Tree

首先给出伪代码

树的构建：

A. Algorithm 2 (Doubly Linked Tree Construction)

Input: A Web access sequence database WAS and a set of all possible events E.

Output: A doubly linked tree T.

Method:

Scan 1:

1. For each access sequence S of the WAS

	1.1. For each event in E

		1.1.1. For each event of an access sequence of WAS. If selected event of access sequence is equal to selected event of E then

		a. event count = event count + 1

		b. continue with the next event in E.

2. For each event in E if event qualify the threshold add that event in the set of frequent event FE.

Scan 2:

1. Create a root node for T

2. For each access sequence S in the access sequence database WAS do

	(a) Extract frequent subsequence S’ from S by removing all events appearing in S but not in FE. Let S' = s1s2….sn , where si (1≤ i ≤ n) are events in S’. Let current node is a pointer that is currently pointing to the root of T.

	(b) For i=1 to n do, if current node has a child labeled si , increase the count of si by 1 and make current node point to si , else create a new child node with label= si , count =1, parent pointer = current node and make current node point to the new node, and insert it into the si -queue

3. Return (T);

其中Scan1首先计算输入序列的每个字符出现的频度，Scan2筛除其中频度低于阀值lamda的字符，然后构建一颗字典树T，同时要让上一个Si指向此Si，最后返回这颗树。

最后这颗树看起来应该是这个样子的

TID	Web access sequence	Frequent subsequence
100	abdac	abac
200	eaebcac	abcac
300	babfaec	babac
400	afbacfc	abacc

最后在树中进行搜索频发序列，伪代码如下：

B. Algorithm 2 (Mining all ξ-patterns in doubly linked tree)

Input: a Doubly linked tree T and support threshold ξ.

Output: the complete set of ξ-patterns.

Method:

1. If doubly linked tree T has only one branch, return all the unique combinations of nodes in that branch

2. Initialize Web access pattern set WAP=φ. Every event in T itself is a Web access pattern, insert them into WAP

3. For each event ei in T,

	a. Construct a conditional sequence base of ei , i.e.PS( ei ), by following the ei -queue, count conditional frequent events at the same time.

	b. If the set of conditional frequent events is not empty, build a conditional doubly linked tree for ei over PS( ei ) using algorithm 1. Recursively mine the conditional doubly linked tree

c. For each Web access pattern returned from mining the conditional doubly linked tree, concatenate ei to it and insert it into WAP.

4. Return WAP.

最后我们就会得到频繁项如下：

{c, aac, bac, abac, ac, abc, bc, b, ab, a, aa,ba, aba}

PLWAP-Tree

人们在运用WAP-Tree的过程中，发现其在时间复杂度上并不理想，请看原文《PLWAP Sequential Mining: Open Source Code》中对PLWAP-Tree的一段介绍：

Abstract -PLWAP algorithm uses a preorder linked, position code dversion of WAP tree and eliminates the need to recursively re-construct intermediate WAP trees
during sequential mining as done by WAP tree technique. PLWAP produces sig-nificant reduction in response time achieved by the WAP algorithm and provides a position code mechanism for remembering the stored database, thus, eliminating the need to re-scan the
original database as would be necessary for applications like those incrementally maintaining mined frequent patterns, performing stream or dynamic mining.

大致意思就是：PLWAP 算法使用先序遍历整个树来建立Head-Table链表队列，并且为每个节点设置一个独一无二的编号，并且可以根据这个编号立刻知道一个节点是不是另一个节点的子节点。

最后相同数据下PLWAP-Tree构造如图：

看图应该很容易懂，这里提示几点方便大家理解：

1、上图中比如{c:1:1110}表示这个节点代表的字符是c，而其权重是1，即只有1个c，而1110表示这个节点的编号。编号规则是

①根节点编号为空

②对于节点u其编号为s，设其子节点从左到右分别为v1，v2，v3……，则其编号分别s1，s10，s100……以此类推，即每次多一个0

这样判断p是否是q的后辈点的方法就是：在q的后面加一个“1”，然后判断是否是p的前缀，如果是则p是q的后辈节点

2、关于Head-Table，在PLWAP-Tree中其是在整棵树构建成功后再构建PLWAP-Tree链表的（和WAP-Tree的不同，希望大家好好体会），构建的方案是按照先序遍历的顺序（上图的虚线部分）。大家可以和WAP-Tree的Head-Table的虚线箭头做一下对比，很容易就能发现它们的区别。

PLWAP-Tree代码实现（c++）

这里放上我自己实现的PLWAP-Tree代码，供给大家参考

#include <stdio.h>

#include <tchar.h>

#include <string>

#include <cstring>

#include <vector>

#include <iostream>

#include <string>

#include <map>

#define alp_maxn 130

using namespace std;

struct Node{

	char alp;

	int alp_count;

	struct Node * nex;

	vector<struct Node*>son;

	string seq;

	Node(int _siz, char _alp);

};

class PLWAPTREE{

private:

	Node * root;			//the root of the plwap-tree

	Node * Head_Table[alp_maxn];	//Head_Table

	Node * alp_las[alp_maxn];

	int lamda;		//lamda

	int alp_tot;		//the number of valid words

	char alp_link[alp_maxn];	//discratization

	int alp_count[alp_maxn];	//discratization

	map<char, int>alp_translate;	//discratization

public:

	vector<string>reads;

	vector<string>feq;		//the frequent words

	void Init(int _lamda);

	void AddString(string st);

	void BuildTree();

	void BuildTree(Node *s, string id);

	void SearchFeq(vector<string>R, string now_feq);

	void print_tree(Node *s);	//debug only...

	Node * get_root();		//debug only...

};

Node * PLWAPTREE::get_root(){

	return root;

}

void PLWAPTREE::print_tree(Node *s){

	if (s == NULL) return;

	cout << "char : " << s->alp << " seq : " << s->seq << " alp_count : " << s->alp_count;

	if (s->nex != NULL) cout << " nex_seq :" << s->nex->seq << endl;

	else cout << endl;

	for (int i = 0; i < alp_tot; i++)

		print_tree(s->son[i]);

}

Node::Node(int _siz, char _alp = -1){

	nex = NULL;

	son.clear();

	while (_siz--) {

		son.push_back(NULL);

	}

	alp = _alp;

	alp_count = 0;

}

void PLWAPTREE::Init(int _lamda){

	root = new Node(alp_maxn);

	for (int i = 0; i < alp_maxn; i++){

		Head_Table[i] = NULL;

		alp_count[i] = 0;

		alp_las[i] = NULL;

	}

	reads.clear();

	feq.clear();

	alp_translate.clear();

	alp_tot = 0;

	lamda = _lamda;

}

void PLWAPTREE::AddString(string st){

	int alp_tmp[alp_maxn];

	memset(alp_tmp, 0, sizeof(alp_tmp));

	for (int i = 0; i < st.length(); i++)

		alp_tmp[(int)st[i]] = 1;

	for (int i = 0; i < alp_maxn; i++)

		alp_count[i] += alp_tmp[i];

	reads.push_back(st);

}

void PLWAPTREE::BuildTree(){

	for (int i = 0; i < alp_maxn; i++){

		if (alp_count[i] >= lamda){

			alp_link[alp_tot] = (char)i;

			alp_translate[(char)i] = alp_tot;

			alp_tot++;

		}

	}				//discretization to save memory and time

	printf("-discretization success !\n");

	for (int i = 0; i < reads.size(); i++){

		string now_string = reads[i];

		Node * pnow = root;

		for (int j = 0; j < now_string.length(); j++){

			if (alp_count[(int)now_string[j]] < lamda) continue;

			int sig = alp_translate[now_string[j]];

			if (pnow->son[sig] == NULL){

				Node * tmp = new Node(alp_tot, now_string[j]);

				pnow->son[sig] = tmp;

			}

			pnow = pnow->son[sig];

			pnow->alp_count++;

		}

	}

	printf("-trip-build success !\n");

	BuildTree(root, "");

}

void PLWAPTREE::BuildTree(Node *s, string id){

	string seq = id + "1";

	for (int i = 0; i < alp_tot; i++){

		if (s->son[i] == NULL) continue;

		if (Head_Table[i] == NULL){

			Head_Table[i] = s->son[i];

		}

		if (alp_las[i] != NULL){

			alp_las[i]->nex = s->son[i];

		}

		alp_las[i] = s->son[i];

		s->son[i]->seq = seq;

		BuildTree(s->son[i], seq);

		seq = seq + "0";

	}

}

void PLWAPTREE::SearchFeq(vector<string>R, string now_feq){

	for (int i = 0; i < alp_tot; i++){

		Node * p = Head_Table[i];

		bool flag = true;

		if (R.size() != 0){

			flag = false;

			while (p != NULL){

				for (int j = 0; j < R.size(); j++){

					string str = R[j] + "1";

					int sig = p->seq.find(str);

					if (sig == 0){

						flag = true;

						break;

					}

				}

				if (flag) break;

				p = p->nex;

			}

		}

		if (flag == false) continue;

		int C = p->alp_count;

		string S = p->seq;

		vector<string>Rs; Rs.clear();

		Rs.push_back(p->seq);

		for (p = p->nex; p != NULL; p = p->nex){

			bool is_son_of_R = false;

			bool is_son_of_S = false;

			if (R.size() == 0) is_son_of_R = true;

			else{

				for (int j = 0; j < R.size(); j++){

					string str = R[j] + "1";

					int sig = p->seq.find(str);

					if (sig == 0){

						is_son_of_R = true;

						break;

					}

				}

			}

			string str = S + "1";

			int sig = p->seq.find(str);

			if (sig == 0){

				is_son_of_S = true;

			}

			if (is_son_of_R == true && is_son_of_S == false){

				C += p->alp_count;

				Rs.push_back(p->seq);

				S = p->seq;

			}

		}

		if (C >= lamda){

			feq.push_back(now_feq + alp_link[i]);

			SearchFeq(Rs, now_feq + alp_link[i]);

		}

	}

}

int main(){

	PLWAPTREE pt;

	pt.Init(3);

	printf("Init success !\n");

	pt.AddString("abdac");

	pt.AddString("eaebcac");

	pt.AddString("babfaec");

	pt.AddString("afbacfc");

	printf("read string success !\n");

	pt.BuildTree();

	printf("Buile tree success !\n");

	/*

	printf("tree just like :\n");

	pt.print_tree(pt.get_root());

	*/

	vector<string>tmp; tmp.clear();

	pt.SearchFeq(tmp, "");

	printf("result : \n");

	for (int i = 0; i < pt.feq.size(); i++)

		cout << pt.feq[i] << endl;

	getchar();

	return 0;

}

参考资料：

《Effective Web Log Mining using WAP Tree-Mine》

《PLWAP Sequential Mining: Open Source Code ∗》

等

数据挖掘：WAP-Tree与PLWAP-Tree的更多相关文章

【数据挖掘】分类之decision tree（转载）
[数据挖掘]分类之decision tree. 1. ID3 算法 ID3 算法是一种典型的决策树(decision tree)算法,C4.5, CART都是在其基础上发展而来.决策树的叶子节点表示类 ...
B-Tree、B+Tree和B*Tree
B-Tree(这儿可不是减号,就是常规意义的BTree) 是一种多路搜索树: 1.定义任意非叶子结点最多只有M个儿子:且M>2: 2.根结点的儿子数为[2, M]: 3.除根结点以外的非叶子结点 ...
【Luogu1501】Tree（Link-Cut Tree）
[Luogu1501]Tree(Link-Cut Tree) 题面洛谷题解 \(LCT\)版子题看到了顺手敲一下而已注意一下,别乘爆了 #include<iostream> #in ...
【BZOJ3282】Tree （Link-Cut Tree）
[BZOJ3282]Tree (Link-Cut Tree) 题面 BZOJ权限题呀,良心luogu上有题解 Link-Cut Tree班子提最近因为NOIP考炸了学科也炸了时间显然没有以后 ...
[LeetCode] Encode N-ary Tree to Binary Tree 将N叉树编码为二叉树
Design an algorithm to encode an N-ary tree into a binary tree and decode the binary tree to get the ...
平衡二叉树(Balanced Binary Tree 或 Height-Balanced Tree)又称AVL树
平衡二叉树(Balanced Binary Tree 或 Height-Balanced Tree)又称AVL树 (a)和(b)都是排序二叉树,但是查找(b)的93节点就需要查找6次,查找(a)的93 ...
WPF中的Visual Tree和Logical Tree与路由事件
1.Visual Tree和Logical TreeLogical Tree:逻辑树,WPF中用户界面有一个对象树构建而成,这棵树叫做逻辑树,元素的声明分层结构形成了所谓的逻辑树!!Visual Tr ...
笔试算法题（39）：Trie树（Trie Tree or Prefix Tree）
议题:TRIE树 (Trie Tree or Prefix Tree): 分析: 又称字典树或者前缀树,一种用于快速检索的多叉树结构:英文字母的Trie树为26叉树,数字的Trie树为10叉树:All ...
LC 431. Encode N-ary Tree to Binary Tree 【lock，hard】
Design an algorithm to encode an N-ary tree into a binary tree and decode the binary tree to get the ...
将百分制转换为5分制的算法 Binary Search Tree ordered binary tree sorted binary tree Huffman Tree
1.二叉搜索树:去一个陌生的城市问路到目的地: for each node, all elements in its left subtree are less-or-equal to the nod ...

随机推荐

关闭myeclipse中烦人的鼠标划过,自动提示功能
eclipse越来越智能,身为码农的我却越来越伤心.虽然你很智能,但请你提供一些有用的信息给我,不要乱七八槽的,不问青红皂白就塞一大堆提示给我,对不起,哥不需要这些!!! 都知道,使用myeclips ...
spring cloud服务间调用feign
参考文章:Spring Cloud Feign设计原理 1.feign是spring cloud服务间相互调用的组件,声明式.模板化的HTTP客户端.类似的HttpURLConnection.Apac ...
Windows系统命令整理-Win10
硬件相关显卡显卡升级 - 我的电脑->属性->设备管理器->显示适配器->更新驱动程序服务 telnet 安装:启用或关闭Windows 功能,勾选上“Telnet客户端 ...
Mac005--VS&webstorm前端开发工具安装
Mac--Visual studio Code工具安装(企业常用) 安装网址:https://code.visualstudio.com/download 设置格式: 1.配置工作区与终端字体大小常 ...
洛谷P2865 [USACO06NOV]路障Roadblocks——次短路
给一手链接 https://www.luogu.com.cn/problem/P2865 这道题其实就是在维护最短路的时候维护一下次短路就okay了 #include<cstdio> #i ...
03 - Jmeter用户自定义变量CSV参数化以及断言的设置
设置断言咱们还是先看一个图吧,由下图可以看出接口是请求成功了,但是请求数量比较少,还是比较方便看的,但是jmeter既然是压测工具,那么肯定不会发这么点儿请求的,如果请求数量比较庞大的话,我们仅仅凭 ...
MYSQL 的七种join
建表在这里呢我们先来建立两张有外键关联的张表. CREATE DATABASE db0206; USE db0206; CREATE TABLE `db0206`.`tbl_dept`( `id` ...
洛谷 P1440 求m区间内的最小值(单调队列)
题目链接 https://www.luogu.org/problemnew/show/P1440 显然是一道单调队列题目…… 解题思路对于单调队列不明白的请看这一篇博客:https://www.cn ...
SVN合并主干分支的方法
第一步第二步第三步第四步
python开发之路-day03
一文件操作一介绍计算机系统分为:计算机硬件,操作系统,应用程序三部分. 我们用python或其他语言编写的应用程序若想要把数据永久保存下来,必须要保存于硬盘中,这就涉及到应用程序要操作硬件,众所 ...

数据挖掘：WAP-Tree与PLWAP-Tree

数据挖掘：WAP-Tree与PLWAP-Tree的更多相关文章

随机推荐

热门专题