我们首先应该从WAP-Tree说起,下面一段话摘自《Effective Web Log Mining using WAP Tree-Mine》原文
page and number of entries in the web logs is increasing rapidly. These web logs,when mined properly can provide useful information for decision-making. Sequential pattern mining discovers frequent user access patterns from web logs. Since Apriori-like sequential
pattern mining techniques requires expensive multiple scans of database. But, recently a novel data structure, known as Web Access Pattern Tree (or WAP-tree), was developed. This proposed method an efficient WAP-tree mining algorithm,known as DLT-mine (Doubly
Linked Tree algorithm). Proposed recursive algorithm uses this doubly Linked tree to efficiently find all access patterns that satisfy user specified criteria. This mining algorithm is faster than the other Apriori-based mining algorithms.
后来在实现WAP-Tree的算法的过程中,人们发现WAP-Tree在搜索频繁项的过程中还可以更进一步的优化,于是人们将它改进后成为“Pre-Order Linked WAP-Tree”(简称PLWAP-Tree),具体内容我们会在下面陈述。
A. Algorithm 2 (Doubly Linked Tree Construction)
Input: A Web access sequence database WAS and a set of all possible events E.
Output: A doubly linked tree T.
Scan 1:
1. For each access sequence S of the WAS
1.1. For each event in E
1.1.1. For each event of an access sequence of WAS. If selected event of access sequence is equal to selected event of E then
a. event count = event count + 1
b. continue with the next event in E.
2. For each event in E if event qualify the threshold add that event in the set of frequent event FE. Scan 2:
1. Create a root node for T
2. For each access sequence S in the access sequence database WAS do
(a) Extract frequent subsequence S’ from S by removing all events appearing in S but not in FE. Let S' = s1s2….sn , where si (1≤ i ≤ n) are events in S’. Let current node is a pointer that is currently pointing to the root of T.
(b) For i=1 to n do, if current node has a child labeled si , increase the count of si by 1 and make current node point to si , else create a new child node with label= si , count =1, parent pointer = current node and make current node point to the new node, and insert it into the si -queue
3. Return (T);
TID | Web access sequence | Frequent subsequence |
100 | abdac | abac |
200 | eaebcac | abcac |
300 | babfaec | babac |
400 | afbacfc | abacc |
B. Algorithm 2 (Mining all ξ-patterns in doubly linked tree)
Input: a Doubly linked tree T and support threshold ξ.
Output: the complete set of ξ-patterns.
1. If doubly linked tree T has only one branch, return all the unique combinations of nodes in that branch
2. Initialize Web access pattern set WAP=φ. Every event in T itself is a Web access pattern, insert them into WAP
3. For each event ei in T,
a. Construct a conditional sequence base of ei , i.e.PS( ei ), by following the ei -queue, count conditional frequent events at the same time.
b. If the set of conditional frequent events is not empty, build a conditional doubly linked tree for ei over PS( ei ) using algorithm 1. Recursively mine the conditional doubly linked tree
c. For each Web access pattern returned from mining the conditional doubly linked tree, concatenate ei to it and insert it into WAP.
4. Return WAP.
{c, aac, bac, abac, ac, abc, bc, b, ab, a, aa,ba, aba}
人们在运用WAP-Tree的过程中,发现其在时间复杂度上并不理想,请看原文《PLWAP Sequential Mining: Open Source Code》中对PLWAP-Tree的一段介绍:
during sequential mining as done by WAP tree technique. PLWAP produces sig-nificant reduction in response time achieved by the WAP algorithm and provides a position code mechanism for remembering the stored database, thus, eliminating the need to re-scan the
original database as would be necessary for applications like those incrementally maintaining mined frequent patterns, performing stream or dynamic mining.
#include <stdio.h>
#include <tchar.h>
#include <string>
#include <cstring>
#include <vector>
#include <iostream>
#include <string>
#include <map> #define alp_maxn 130 using namespace std; struct Node{
char alp;
int alp_count;
struct Node * nex;
vector<struct Node*>son;
string seq;
Node(int _siz, char _alp);
}; class PLWAPTREE{
Node * root; //the root of the plwap-tree
Node * Head_Table[alp_maxn]; //Head_Table
Node * alp_las[alp_maxn];
int lamda; //lamda int alp_tot; //the number of valid words
char alp_link[alp_maxn]; //discratization
int alp_count[alp_maxn]; //discratization
map<char, int>alp_translate; //discratization public: vector<string>reads;
vector<string>feq; //the frequent words void Init(int _lamda);
void AddString(string st);
void BuildTree();
void BuildTree(Node *s, string id);
void SearchFeq(vector<string>R, string now_feq); void print_tree(Node *s); //debug only...
Node * get_root(); //debug only...
}; Node * PLWAPTREE::get_root(){
return root;
} void PLWAPTREE::print_tree(Node *s){
if (s == NULL) return;
cout << "char : " << s->alp << " seq : " << s->seq << " alp_count : " << s->alp_count;
if (s->nex != NULL) cout << " nex_seq :" << s->nex->seq << endl;
else cout << endl;
for (int i = 0; i < alp_tot; i++)
} Node::Node(int _siz, char _alp = -1){
nex = NULL;
while (_siz--) {
alp = _alp;
alp_count = 0;
} void PLWAPTREE::Init(int _lamda){
root = new Node(alp_maxn);
for (int i = 0; i < alp_maxn; i++){
Head_Table[i] = NULL;
alp_count[i] = 0;
alp_las[i] = NULL;
alp_tot = 0;
lamda = _lamda;
} void PLWAPTREE::AddString(string st){
int alp_tmp[alp_maxn];
memset(alp_tmp, 0, sizeof(alp_tmp));
for (int i = 0; i < st.length(); i++)
alp_tmp[(int)st[i]] = 1;
for (int i = 0; i < alp_maxn; i++)
alp_count[i] += alp_tmp[i];
} void PLWAPTREE::BuildTree(){
for (int i = 0; i < alp_maxn; i++){
if (alp_count[i] >= lamda){
alp_link[alp_tot] = (char)i;
alp_translate[(char)i] = alp_tot;
} //discretization to save memory and time printf("-discretization success !\n"); for (int i = 0; i < reads.size(); i++){
string now_string = reads[i];
Node * pnow = root;
for (int j = 0; j < now_string.length(); j++){
if (alp_count[(int)now_string[j]] < lamda) continue;
int sig = alp_translate[now_string[j]];
if (pnow->son[sig] == NULL){
Node * tmp = new Node(alp_tot, now_string[j]);
pnow->son[sig] = tmp;
pnow = pnow->son[sig];
} printf("-trip-build success !\n"); BuildTree(root, "");
} void PLWAPTREE::BuildTree(Node *s, string id){
string seq = id + "1";
for (int i = 0; i < alp_tot; i++){
if (s->son[i] == NULL) continue;
if (Head_Table[i] == NULL){
Head_Table[i] = s->son[i];
if (alp_las[i] != NULL){
alp_las[i]->nex = s->son[i];
alp_las[i] = s->son[i];
s->son[i]->seq = seq;
BuildTree(s->son[i], seq);
seq = seq + "0";
} void PLWAPTREE::SearchFeq(vector<string>R, string now_feq){
for (int i = 0; i < alp_tot; i++){
Node * p = Head_Table[i];
bool flag = true;
if (R.size() != 0){
flag = false;
while (p != NULL){
for (int j = 0; j < R.size(); j++){
string str = R[j] + "1";
int sig = p->seq.find(str);
if (sig == 0){
flag = true;
if (flag) break;
p = p->nex;
if (flag == false) continue; int C = p->alp_count;
string S = p->seq;
vector<string>Rs; Rs.clear();
Rs.push_back(p->seq); for (p = p->nex; p != NULL; p = p->nex){
bool is_son_of_R = false;
bool is_son_of_S = false;
if (R.size() == 0) is_son_of_R = true;
for (int j = 0; j < R.size(); j++){
string str = R[j] + "1";
int sig = p->seq.find(str);
if (sig == 0){
is_son_of_R = true;
string str = S + "1";
int sig = p->seq.find(str);
if (sig == 0){
is_son_of_S = true;
if (is_son_of_R == true && is_son_of_S == false){
C += p->alp_count;
S = p->seq;
} if (C >= lamda){
feq.push_back(now_feq + alp_link[i]);
SearchFeq(Rs, now_feq + alp_link[i]);
} int main(){
pt.Init(3); printf("Init success !\n"); pt.AddString("abdac");
pt.AddString("afbacfc"); printf("read string success !\n"); pt.BuildTree(); printf("Buile tree success !\n");
printf("tree just like :\n");
*/ vector<string>tmp; tmp.clear();
pt.SearchFeq(tmp, ""); printf("result : \n"); for (int i = 0; i < pt.feq.size(); i++)
cout << pt.feq[i] << endl; getchar();
return 0;
