NLP—WordNet——词与词之间的最小距离

WordNet，是由Princeton 大学的心理学家，语言学家和计算机工程师联合设计的一种基于认知语言学的英语词典。它不是光把单词以字母顺序排列，而且按照单词的意义组成一个“单词的网络”。我们这次的任务就是求得词与词之间的最短路径，是对“图”这个数据结构再次灵活运用。

以下为SentiWordNet_3.0.0_20130122.txt文件截图：

应考虑如何存储“单词的网络”，此程序是以词作为基本单元，词与词之间的联系是通过语义。

我们简单地构造类（ListofSeg存储词的语义id）：

class C_word

{

    public string word;

    public List<int> ListofSeg = new List<int>();

    public C_word(string a)

    {

        ListofSeg = new List<int>();

        word = a;

    }

}

程序大约分成两部分：

第一部分，将“语义-词-词-...-词”格式转换成“词-语义-语义-...-语义”格式；

第二部分，构造单词网络，通过广度遍历寻找最小距离。

（运行程序的过程中发现第一部分耗时较长，担心调试第二部分时花费额外无用时间，故将第一部分结果生成文本；第二部分通过读文本来获得信息，而不是再执行一遍第一部分：），第一次将中期结果生成文本存储，觉得要这方法不错，节省大量时间！）

PART1：

using System.Collections.Generic;

using System.Text;

using System.IO;

namespace WordNet

{

    class Program

    {

        static void Main(string[] args)

        {

            List<C_word> ListofCword = new List<C_word>();

            //读取文件

            string dicpath = @"C:\Users\Lian\Desktop\SentiWordNet_3.0.0_20130122.txt";

            StreamReader sr = new StreamReader(dicpath, Encoding.Default);

            //我们用行号作为词的语义id标识，而没有使用wordnet文本中的语义id

            int sen_count = ;

            string line;

            //将所有的词以及它的语义编号存在ListofCword中

            while ((line = sr.ReadLine()) != null)

            {

                //选择需要的信息

                if(line[]!='#')

                {

                    //通过制表符'\t'分割line

                    string[] a = line.Split('\t');

                    //将我们需要的词拿出来

                    string[] b = a[].Split(' ');

                    if(b.Length>)

                    {

                        //c来存储这一行中的词

                        string[] c = new string[b.Length];

                        //去掉'#'

                        for (int i = ; i < b.Length; i++)

                        {

                            int tip = b[i].IndexOf('#');

                            c[i] = b[i].Substring(, tip);

                        }

                        //将c[]的词存入ListofCword中

                        for (int j = ; j < c.Length; j++)

                        {

                            //向ListofCword中存储第一个WORD

                            if (ListofCword.Count == )

                            {

                                C_word tempword = new C_word(c[j]);

                                ListofCword.Add(tempword);

                                ListofCword[].ListofSeg.Add(sen_count);

                            }

                            else

                            {

                                //用来判断这个词是否出现在ListofCword中

                                bool e = true;

                                //遍历整个ListofCword，查找词是否出现

                                for (int i = ; i < ListofCword.Count; i++)

                                    //若出现，则存储行号（即语义id）至这个词的ListofSeg中，并且跳出循环。

                                    if (ListofCword[i].word == c[j])

                                    {

                                        ListofCword[i].ListofSeg.Add(sen_count);

                                        e = false;

                                        break;

                                    }

                                //若这个词在整个ListofCword中没有出现，则将此词加入ListofCword中

                                if (e)

                                {

                                    C_word tempword = new C_word(c[j]);

                                    ListofCword.Add(tempword);

                                    ListofCword[ListofCword.Count - ].ListofSeg.Add(sen_count);

                                }

                            }

                        }

                    }

                }

                //行号（语义id）++

                sen_count++;

            }

            //将ListofCword存储至文件中，方便以后使用。

            string path = @"C:\Users\Lian\Desktop\wordnet.txt";

            FileStream fs = new FileStream(path, FileMode.Create);

            StreamWriter sw = new StreamWriter(fs);

            for (int i = ; i < ListofCword.Count; i++)

            {

                sw.Write(ListofCword[i].word);

                for (int j = ; j < ListofCword[i].ListofSeg.Count; j++)

                    sw.Write("\t" + ListofCword[i].ListofSeg[j]);

                sw.Write("\r\n");

            }

            //清空缓冲区

            sw.Flush();

            //关闭流

            sw.Close();

            fs.Close();

        }

    }

}

----------------------------------------------------------------------------------

PART2:

using System;

using System.Collections.Generic;

using System.Text;

using System.IO;

namespace WordNet_Next

{

    class Program

    {

        static void Main(string[] args)

        {

            List<C_word> ListofCword = new List<C_word>();

            string path = @"C:\Users\Lian\Desktop\wordnet.txt";

            StreamReader sr = new StreamReader(path, Encoding.Default);

            String line;

            while ((line = sr.ReadLine()) != null)

            {

                string[] temp = line.Split('\t');

                C_word tempword = new C_word(temp[]);

                for (int i = ; i < temp.Length; i++)

                    tempword.ListofSeg.Add(int.Parse(temp[i]));

                ListofCword.Add(tempword);

            }

            string worda="dog", wordb="robot";

            //需要查询的

            List<int> ListofSearch = new List<int>();

            //已查询过的

            List<int> ListofHaved = new List<int>();

            //临时存放的

            List<int> ListofTemp = new List<int>();

            //初始化工作

            int worda_index, wordb_index;

            for (worda_index = ; worda != ListofCword[worda_index].word; worda_index++) ;

            for (wordb_index = ; wordb != ListofCword[wordb_index].word; wordb_index++) ;

            //把worda的语义id存入ListofSearch,ListofHaved,ListofTemp中

            for (int i = ; i < ListofCword[worda_index].ListofSeg.Count; i++)

            {

                ListofSearch.Add(ListofCword[worda_index].ListofSeg[i]);

                ListofHaved.Add(ListofCword[worda_index].ListofSeg[i]);

                ListofTemp.Add(ListofCword[worda_index].ListofSeg[i]);

            }

            int distance = -;

            Boolean Searched = false;

            while(true)

            {

                distance++;

                //判断wordb中是否拥有Temp列表中的语义

                for(int apple=;apple<ListofCword[wordb_index].ListofSeg.Count;apple++)

                    for(int pear=;pear<ListofTemp.Count;pear++)

                    {

                        if(ListofCword[wordb_index].ListofSeg[apple]==ListofTemp[pear])

                        {

                            Searched = true;

                            break;

                        }

                    }

                //若有，则退出循环，准备输出Distance

                if (Searched)

                    break;

                //清空搜索列表

                ListofSearch.Clear();

                //将需要搜索的临时列表添加入搜索列表

                for (int apple = ; apple < ListofTemp.Count; apple++)

                    ListofSearch.Add(ListofTemp[apple]);

                //清空临时列表

                ListofTemp.Clear();

                //用来判断judge在整个ListofCword中，哪个词拥有在search的语义

                for (int apple = ; apple < ListofCword.Count; apple++)

                {

                    //用来判断一个词是否拥有任何一个在search的语义

                    Boolean judge = false;

                    //循环一个词的全部语义

                    for (int pear = ; pear < ListofCword[apple].ListofSeg.Count; pear++)

                    {

                        //循环要search的语义

                        for (int orange = ; orange < ListofSearch.Count; orange++)

                            //如果匹配上了，那么就直接跳出两层循环

                            if (ListofCword[apple].ListofSeg[pear] == ListofSearch[orange])

                            {

                                judge = true;

                                break;

                            }

                        if (judge)

                            break;

                    }

                    //如果这个词拥有search的语义，那么将这个词的语义存入temp队列中（若在Haved队列中，则不存储）

                    //否则，直接判断下一个词

                    if (judge)

                    {

                        //循环一个词的全部语义

                        for (int pear = ; pear < ListofCword[apple].ListofSeg.Count; pear++)

                        {

                            Boolean judge_temp = true;

                            //循环Haved中已有的语义

                            for (int orange = ; orange < ListofHaved.Count; orange++)

                            {

                                //判断这个词的某一个语义是否存在于Haved队列中，若存在，则跳出一层循环，继续判断这个词的另一语义

                                if (ListofHaved[orange] == ListofCword[apple].ListofSeg[pear])

                                {

                                    judge_temp = false;

                                    break;

                                }

                            }

                            //若这个词的这个语义不存在于Haved队列中，则存入Temp队列以及Haved队列中

                            if (judge_temp)

                            {

                                ListofTemp.Add(ListofCword[apple].ListofSeg[pear]);

                                ListofHaved.Add(ListofCword[apple].ListofSeg[pear]);

                            }

                        }

                    }

                }

                if (ListofTemp.Count == )

                {

                    Console.WriteLine("UNREACHABLE!");

                    break;

                }

            }

            Console.WriteLine(distance);

        }

    }

}

程序本身，有很多细节可以优化，有些偷懒，没有耐心去思考如何可以更快地解决词与词之间最小距离问题。

如有建议，请尽情指点！

PS：第二部分一气呵成，没有调试一次。。。一次成功，感到惊奇，故留念一下：）

小LiAn

2017/4/15夜

NLP—WordNet——词与词之间的最小距离的更多相关文章

Deep Learning in NLP （一）词向量和语言模型
原文转载:http://licstar.net/archives/328 Deep Learning 算法已经在图像和音频领域取得了惊人的成果,但是在 NLP 领域中尚未见到如此激动人心的结果.关于这 ...
Word2Vec之Deep Learning in NLP （一）词向量和语言模型
转自licstar,真心觉得不错,可惜自己有些东西没有看懂这篇博客是我看了半年的论文后,自己对 Deep Learning 在 NLP 领域中应用的理解和总结,在此分享.其中必然有局限性,欢迎各种交 ...
【中文同义词近义词】词向量 vs 同义词近义词库
方案一:利用预训练好的词向量模型优点: (1)能把词进行语义上的向量化(2)能得到词与词的相似度缺点: (1)词向量的效果和语料库的大小和质量有较大的关系(2)用most_similar() 得到 ...
斯坦福深度学习与nlp第四讲词窗口分类和神经网络
http://www.52nlp.cn/%E6%96%AF%E5%9D%A6%E7%A6%8F%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0%E4%B8%8Enlp%E7%A ...
R+NLP︱text2vec包——BOW词袋模型做监督式情感标注案例（二,情感标注）
要学的东西太多,无笔记不能学~~ 欢迎关注公众号,一起分享学习笔记,记录每一颗"贝壳"~ --------------------------- 在之前的开篇提到了text2vec ...
词向量词嵌入 word embedding
词嵌入 word embedding embedding 嵌入 embedding: 嵌入, 在数学上表示一个映射f:x->y, 是将x所在的空间映射到y所在空间上去,并且在x空间中每一个x有y ...
[超详细] Python3爬取豆瓣影评、去停用词、词云图、评论关键词绘图处理
爬取豆瓣电影<大侦探皮卡丘>的影评,并做词云图和关键词绘图第一步:找到评论的网页url.https://movie.douban.com/subject/26835471/comments ...
(hdu 7.1.8)Quoit Design(最低点——在n一个点,发现两点之间的最小距离)
主题: Quoit Design Time Limit: 10000/5000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Others) T ...
NLP︱句子级、词语级以及句子-词语之间相似性（相关名称：文档特征、词特征、词权重）
每每以为攀得众山小,可.每每又切实来到起点,大牛们,缓缓脚步来俺笔记葩分享一下吧,please~ --------------------------- 关于相似性以及文档特征.词特征有太多种说法.弄 ...

随机推荐

（5）可变、不可变和hash函数
分类情况与列表相似,列表用[],元组是()表示内存角度看列表与数字的变与不变列表 >>>l = [1,2,3,4] >>>id(l) 4392665160 & ...
signal()信号操作
一.函数描述 #include <signal.h> typedef void (*sighandler_t)(int);sighandler_t signal(int signum, s ...
graphql-modules 企业级别的graphql server 工具
graphql-modules 是一个新开源的graphql 工具,是基于apollo server 2.0 的扩展库,该团队认为开发应该是模块化的. 几张来自官方团队的架构图可以参考,方便比较 a ...
解决 php提交表单到当前页面，刷新会重复提交的问题
http://blog.csdn.net/u012466451/article/details/68952280
Mac 上 java 究竟在哪里，本文彻底让你搞清楚！
Mac下当你在[终端]输入java -version时,是执行的哪里的java呢,which java命令可以看到,就是[/usr/bin/java] [/usr/bin/java]只是个替身,实际指 ...
win10激活命令
以管理员方式打开命令提示符输入以下3条命令slmgr /ipk W269N-WFGWX-YVC9B-4J6C9-T83GX 按回车slmgr /skms 54.223.212.31 按回车slmgr ...
Java JDBC连接Oracle
1. 安装Oracle数据库,我这里使用的是Oracle 12c 2. 创建Java工程 connection-oracle 注意:使用的JavaSE-1.8 3. 在Oracle的安装目录里,将dj ...
golang sizeof 占用空间大小
C语言中,可以使用sizeof()计算变量或类型占用的内存大小.在Go语言中,也提供了类似的功能, 不过只能查看变量占用空间大小.具体使用举例如下. package main import ( &qu ...
golang kafka client
针对golang的 kafka client 有很多开源package,例如sarama, confluent等等.在使用sarama 包时,高并发中偶尔遇到crash.于是改用confluent-k ...
WPF Demo6
通知项熟悉.数据绑定 using System.ComponentModel; namespace Demo6 { /// <summary> /// 通知项属性 /// </sum ...

NLP—WordNet——词与词之间的最小距离

NLP—WordNet——词与词之间的最小距离的更多相关文章

随机推荐

热门专题