萌新笔记——C++里创建 Trie字典树（中文词典）（一）（插入、遍历）

　　萌新做词典第一篇，做得不好，还请指正，谢谢大佬！

　　写了一个词典，用到了Trie字典树。

　　写这个词典的目的，一个是为了压缩一些数据，另一个是为了尝试搜索提示，就像在谷歌搜索的时候，打出某个关键字，会提示一串可能要搜索的东西。

　　首先放上最终的结果：

input:

 编程入门

 编程软件

 编程学习

 编程学习网站

output:

 char : 件

 word : 编程软件

 char : 习

 word : 编程学习

 char : 网

 word : 编程学习网

 char : 门

 word : 编程入门

　　其实这里不应该用input，因为全是直接写死在main中进行测试的--！

　　对于这个output，char表示此时指向的字，word表示从根到指向的字的路径遍历（原谅菜鸟的我想不到恰当的词）。

　　“编程”两字因没有输入，故不会当做一个词，如果不这样设定，“编”也会被当作一个词打出来的。

　　然后来说说做这个的过程吧。

　　首先，我选择struct与unordered_map结合才表示树，具体代码如下：

 #ifndef __DICTIONARYDATA_H__

 #define __DICTIONARYDATA_H__

 #include <string>

 #include <unordered_map>

 #include <memory>

 namespace ccx{

 using std::string;

 using std::unordered_map;

 using std::shared_ptr;

 struct DictElem

 {

     string _word;//如果是根结点，则填ROOT

     int _wordId;//如果是根结点，则为总词数

     unordered_map<string, shared_ptr<DictElem> > _words;

 };

 typedef shared_ptr<DictElem> pDictElem;

 }

 #endif

　　这里的typedef是为了后面书写的方便而设的。

　　然后，是设计Dictionary类，使它具有添加一个词、添加一组词、遍历所有词的功能，当然我比较菜暂时没想怎么删除一个词的事。以下是最初的Dictionary

Dictionary.h

 #ifndef __DICTIONARY_H__

 #define __DICTIONARY_H__

 #include "DictionaryData.h"

 #include <memory>

 #include <vector>

 #include <list>

 namespace ccx{

 using std::shared_ptr;

 using std::vector;

 using std::list;

 class Dictionary

 {

     typedef unordered_map<string, pDictElem>::iterator WordIt;

     public:

         Dictionary();

         void push(const string & word);

         void push(vector<string> & words);

     private:

         void splitWord(const string & word, vector<string> & characters);//把词拆成字

         pDictElem _dictionary;

 //遍历

     public:

         void next();

         string getCurChar();

         string getCurWord();

         void resetPoint();

         bool isEnd();

     private:

         void nextWord();

         pDictElem _pcur;

         WordIt _itcur;

 //用list实现栈，遍历时方便

         list<WordIt> _stackWord;

         list<pDictElem> _stackDict;

 };

 }

 #endif

　　有了类结构，就可以去实现它了。构造函数做了一个初始化的工作：

 Dictionary::Dictionary()

 : _dictionary(new DictElem)

 {

     _dictionary->_wordId = ;

     _pcur = _dictionary;

 }

　　首先，要把一个string分解成单个的字。在做这个的时候，一开始是天真地认为中国就是占两个字节（gbk编码），总是出现迷之乱码。后来是发现我用的是utf-8编码，才解决了问题（上一篇就是讲这个）。实现代码如下：

 void Dictionary::splitWord(const string & word, vector<string> & characters)

 {

     int num = word.size();

     int i = ;

     while(i < num)

     {

         int size = ;

         if(word[i] & 0x80)

         {

             char temp = word[i];

             temp <<= ;

             do{

                 temp <<= ;

                 ++size;

             }while(temp & 0x80);

         }

         string subWord;

         subWord = word.substr(i, size);

         characters.push_back(subWord);

         i += size;

     }

 }

　　得到了单个字的集合，并存在vector中，就可以生成Trie字典数了。插入一个词的代码如下：

 void Dictionary::push(const string & word)

 {

     vector<string> characters;

     splitWord(word, characters);

     vector<string>::iterator it_char;

     it_char = characters.begin();

     pDictElem root;

     root = _dictionary;

     for(; it_char != characters.end(); ++it_char)

     {

         WordIt it_word;

         it_word = root->_words.find(*it_char);

         if(it_word == root->_words.end())

         {

             pair<string, pDictElem> temp;

             temp.first = *it_char;

             pDictElem dictemp(new DictElem);

             dictemp->_word = *it_char;

             dictemp->_wordId = ;

             temp.second = dictemp;

             root->_words.insert(temp);

             root = dictemp;

         }else{

             root = it_word->second;

         }

     }

     if(!root->_wordId)

     {

         ++(_dictionary->_wordId);

         root->_wordId = _dictionary->_wordId;

     }

 }

　　这里有用到的wordId，其实是给插入的词进行编号。在遍历的时候，如果发现编号是0，则说明并没有插入这个词，可以跳过。

　　然后是插入一组词的代码：

 void Dictionary::push(vector<string> & words)

 {

     int size = words.size();

     for(int i = ; i < size; ++i)

     {

         push(words[i]);

     }

 }

　　直接复用了插入一个词的代码，简单粗暴。

　　接着是写遍历。想到二叉树的遍历既可以用递归，又可以用栈来实现，由于递归要产生额外的开销，我就采用了栈的方法。但是，要把迭代器入栈呢，还是把智能指针入栈的问题又卡着了。由于我这里是专门写了一个next函数，遍历不是在一个函数里“一条龙”一样地完成，于是会有以下两个可能的问题（对本萌新来说）：

　　1、只有智能指针入栈，可以轻松打出一个词，但是怎么让它知道下一个应该指向谁？

　　2、如果只有迭代器入栈，又要怎么知道什么时候应该出栈（end）？对于这个问题有一个解决方案，就是每次处理的时候先出栈，然后再用此时的栈顶元素（它的父节点）来进行判断。不过觉得这样太麻烦了，于是没有这样做。

　　于是选择了两个都入栈的处理方法，结合使用，智能指针栈的栈顶元素永远是迭代器栈的父节点，并且对于root结点，迭代器栈中不存。从而有了以下代码：

 void Dictionary::nextWord()

 {

     if(_pcur)

     {

         if(_pcur->_words.size())

         {

             _stackDict.push_back(_pcur);

             _stackWord.push_back(_pcur->_words.begin());

             _pcur = _stackWord.back()->second;

         }else{

             ++(_stackWord.back());

         }

         while(_stackWord.back() == _stackDict.back()->_words.end())

         {

             _stackDict.pop_back();

             _stackWord.pop_back();

             if(!_stackDict.size())

             {

                 _pcur = NULL;

             }

             ++(_stackWord.back());

         }

         if(_pcur)

         {

             _pcur = _stackWord.back()->second;

         }

     }

 }

 void Dictionary::next()

 {

     while(_pcur)

     {

         nextWord();

         if(!_pcur || _pcur->_wordId)

         {

             break;

         }

     }

 }

　　这个next()是后来加的，因为发现如果不加这个，会把并没有输入的词也打出来，比如最开始的例子“编程软件”，会输出四个词：”编“、”编程“、”编程软“、”编程软件“，这并不是我想要结果，于是加了这么一个判断，跳过所有未输入的词。

　　当然，还要一个初始化的函数：

 void Dictionary::resetPoint()

 {

     _pcur = _dictionary;

     if(_stackDict.size())

     {

         _stackDict.clear();

     }

     if(_stackWord.size())

     {

         _stackWord.clear();

     }

     next();

 }

　　和判断是否读取完毕的函数：

 bool Dictionary::isEnd()

 {

     return _pcur == NULL;

 }

　　最后，就是获取一个词的函数了：

 string Dictionary::getCurWord()

 {

     string temp;

     list<WordIt>::iterator it_word;

     it_word = _stackWord.begin();    

     for(; it_word != _stackWord.end(); ++it_word)

     {

         temp += (*it_word)->first;

     }

     return temp;

 }

　　并且额外写了一个用于测试的，获得当前节点存的什么字的函数

 string Dictionary::getCurChar()

 {

     return _pcur->_word;

 }

　　实现完了所有函数，就开始进行测试了。我专门写了一个test.cc来使用这个类：

 #include "Dictionary.h"

 #include <iostream>

 #include <string>

 #include <vector>

 using std::cout;

 using std::endl;

 using std::string;

 using std::vector;

 int main()

 {

     ccx::Dictionary words;

     string word1 = "编程入门";

     string word2 = "编程软件";

     string word3 = "编程学习";

     string word4 = "编程学习网站";    

     words.push(word1);

     words.push(word2);

     words.push(word3);

     words.push(word4);    

     words.resetPoint();

     while(!words.isEnd())

     {

         cout << "char : " << words.getCurChar() << endl;

         cout << "word : " << words.getCurWord() << endl;

         words.next();

     }

 }

　　测试结果就在开头部分。于是，一个简单的可以插入与遍历的词典就做好了。后来又想应该给它添加查找、导入与导出的功能，想想干脆再开一篇好了。

　　0。0 感觉方法好low啊~~~~

Dictionary.cc

 #include "Dictionary.h"

 #include <iostream>

 #include <string>

 namespace ccx{

 using std::endl;

 using std::cout;

 using std::pair;

 Dictionary::Dictionary()

 : _dictionary(new DictElem)

 {

     _dictionary->_wordId = ;

     _pcur = _dictionary;

 }

 void Dictionary::splitWord(const string & word, vector<string> & characters)

 {

     int num = word.size();

     int i = ;

     while(i < num)

     {

         int size = ;

         if(word[i] & 0x80)

         {

             char temp = word[i];

             temp <<= ;

             do{

                 temp <<= ;

                 ++size;

             }while(temp & 0x80);

         }

         string subWord;

         subWord = word.substr(i, size);

         characters.push_back(subWord);

         i += size;

     }

 }

 void Dictionary::push(const string & word)

 {

     vector<string> characters;

     splitWord(word, characters);

     vector<string>::iterator it_char;

     it_char = characters.begin();

     pDictElem root;

     root = _dictionary;

     for(; it_char != characters.end(); ++it_char)

     {

         WordIt it_word;

         it_word = root->_words.find(*it_char);

         if(it_word == root->_words.end())

         {

             pair<string, pDictElem> temp;

             temp.first = *it_char;

             pDictElem dictemp(new DictElem);

             dictemp->_word = *it_char;

             dictemp->_wordId = ;

             temp.second = dictemp;

             root->_words.insert(temp);

             root = dictemp;

         }else{

             root = it_word->second;

         }

     }

     if(!root->_wordId)

     {

         ++(_dictionary->_wordId);

         root->_wordId = _dictionary->_wordId;

     }

 }

 void Dictionary::push(vector<string> & words)

 {

     int size = words.size();

     for(int i = ; i < size; ++i)

     {

         push(words[i]);

     }

 }

 //遍历用

 void Dictionary::resetPoint()

 {

     _pcur = _dictionary;

     if(_stackDict.size())

     {

         _stackDict.clear();

     }

     if(_stackWord.size())

     {

         _stackWord.clear();

     }

     next();

 }

 void Dictionary::next()

 {

     while(_pcur)

     {

         nextWord();

         if(!_pcur || _pcur->_wordId)

         {

             break;

         }

     }

 }

 void Dictionary::nextWord()

 {

     if(_pcur)

     {

         if(_pcur->_words.size())

         {

             _stackDict.push_back(_pcur);

             _stackWord.push_back(_pcur->_words.begin());

             _pcur = _stackWord.back()->second;

         }else{

             ++(_stackWord.back());

         }

         while(_stackWord.back() == _stackDict.back()->_words.end())

         {

             _stackDict.pop_back();

             _stackWord.pop_back();

             if(!_stackDict.size())

             {

                 _pcur = NULL;

             }

             ++(_stackWord.back());

         }

         if(_pcur)

         {

             _pcur = _stackWord.back()->second;

         }

     }

 }

 string Dictionary::getCurChar()

 {

     return _pcur->_word;

 }

 string Dictionary::getCurWord()

 {

     string temp;

     list<WordIt>::iterator it_word;

     it_word = _stackWord.begin();    

     for(; it_word != _stackWord.end(); ++it_word)

     {

         temp += (*it_word)->first;

     }

     return temp;

 }

 bool Dictionary::isEnd()

 {

     return _pcur == NULL;

 }

 }

萌新笔记——C++里创建 Trie字典树（中文词典）（一）（插入、遍历）的更多相关文章

萌新笔记——C++里创建 Trie字典树（中文词典）（二）（插入、查找、导入、导出）
萌新做词典第二篇,做得不好,还请指正,谢谢大佬! 做好了插入与遍历功能之后,我发现最基本的查找功能没有实现,同时还希望能够把内存的数据存入文件保存下来,并可以从文件中导入词典.此外,数据的路径是存在配 ...
萌新笔记——C++里创建 Trie字典树（中文词典）（三）（联想）
萌新做词典第三篇,做得不好,还请指正,谢谢大佬! 今天把词典的联想做好了,也是比较low的,还改了之前的查询.遍历等代码. Orz 一样地先放上运行结果: test1 ID : char : 件 w ...
C++里创建 Trie字典树（中文词典）（一）（插入、遍历）
萌新做词典第一篇,做得不好,还请指正,谢谢大佬! 写了一个词典,用到了Trie字典树. 写这个词典的目的,一个是为了压缩一些数据,另一个是为了尝试搜索提示,就像在谷歌搜索的时候,打出某个关键字,会提示 ...
C++里创建 Trie字典树（中文词典）（二）（插入、查找、导入、导出）
萌新做词典第二篇,做得不好,还请指正,谢谢大佬! 做好了插入与遍历功能之后,我发现最基本的查找功能没有实现,同时还希望能够把内存的数据存入文件保存下来,并可以从文件中导入词典.此外,数据的路径是存在配 ...
C++里创建 Trie字典树（中文词典）（三）（联想）
萌新做词典第三篇,做得不好,还请指正,谢谢大佬! 今天把词典的联想做好了,也是比较low的,还改了之前的查询.遍历等代码. Orz 一样地先放上运行结果: test1 ID : char : 件 w ...
萌新笔记——C++里将string类字符串（utf-8编码）分解成单个字（可中英混输）
最近在建词典,使用Trie字典树,需要把字符串分解成单个字.由于传入的字符串中可能包含中文或者英文,它们的字节数并不相同.一开始天真地认为中文就是两个字节,于是很happy地直接判断当前位置的字符的A ...
踹树（Trie 字典树）
Trie 字典树 ~~ 比 KMP 简单多了,无脑子选手学不会KMP,不会结论题~~ 自己懒得造图了OI WIKI 真棒字典树大概长这么个亚子呕吼真棒就是将读进去的字符串根据当前的字符是什么和所 ...
Trie字典树动态内存
Trie字典树 #include "stdio.h" #include "iostream" #include "malloc.h" #in ...
算法导论：Trie字典树
1. 概述 Trie树,又称字典树,单词查找树或者前缀树,是一种用于快速检索的多叉树结构,如英文字母的字典树是一个26叉树,数字的字典树是一个10叉树. Trie一词来自retrieve,发音为/tr ...

随机推荐

Javascript常用正则表达式汇总
一.匹配数字和字符串整数或者小数:^[0-9]+\.{0,1}[0-9]{0,2}$ 只能输入数字:"^[0-9]*$". 只能输入n位的数字:"^\d{n}$&qu ...
浅谈移动端之touch事件--手指的滑动事件
今天台风‘海马’袭击深圳,全市停工.现分享一篇关于touch的文章,望指教! 原理: 当开始一个touchstart事件的时候,获取此刻手指的横坐标startX和纵坐标startY: 当触发touch ...
OpenCV2:等间隔采样和局部均值的图像缩小
图像的缩小从物理意义上来说,就是将图像的每个像素的大小缩小相应的倍数.但是,改变像素的物理尺寸显然不是那么容易的,从数字图像处理的角度来看,图像的缩小实际就是通过减少像素个数来实现的.显而易见的,减少 ...
20 个看起来很棒的 Web UI 工具包
程序员们比设计师更需要这些 UI 方面的内容: 1. Mini Reminders Mini Reminders 2. Transluscent UI elements Transluscent UI ...
[Web API] Web API 2 深入系列(1) 路由
目录 ASP.NET 路由注册路由动态映射HttpHandler WebAPI 路由注册路由调用GetRouteData 2个路由系统衔接 GlobalConfiguration Hosted ...
ASP.NET MVC过滤器
在ASP.NET MVC中有个重要特性就是过滤器,使得我们在MVC程序开发中更好的控制浏览器请求的URL,不是每个请求都有响应内容,只有特定得用户才有.园子里关于过滤器的资料也有很多,这篇文章主要是记 ...
PPT里面添加3Dvia Composer Player 控件
本实例是Office 2013 版本 1打开文件-选项---自定义功能区--主选项卡 ---开发工具 :如图 4然后将在菜单栏里面看到开发工具 5然后将出出一个十字绘制你想要的区域 6 鼠标右击 ...
WCF入门教程（四）通过Host代码方式来承载服务
WCF入门教程(四)通过Host代码方式来承载服务之前已经讲过WCF对外发布服务的具体方式. WCF入门教程(一)简介 Host承载,可以是web,也可以是控制台程序等等.比WebService有更 ...
把NetDimension.NanUI项目从C#6.0语法还原到C#5.0
前言找Cef资料时看到一个比较好的封装NanUI for Winform发布,让Winform界面设计拥有无限可能,下载代码后发现是Vs2015+C#6.0开发的,本机没有VS2015也不想安装.于 ...
Java并发编程：线程池的使用
Java并发编程:线程池的使用在前面的文章中,我们使用线程的时候就去创建一个线程,这样实现起来非常简便,但是就会有一个问题: 如果并发的线程数量很多,并且每个线程都是执行一个时间很短的任务就结束了, ...

萌新笔记——C++里创建 Trie字典树（中文词典）（一）（插入、遍历）

萌新笔记——C++里创建 Trie字典树（中文词典）（一）（插入、遍历）的更多相关文章

随机推荐

热门专题