ARPA的n-gram语法如下:

[html] view plaincopyprint?
\data\
ngram 1=64000
ngram 2=522530
ngram 3=173445 \1-grams:
-5.24036 'cause -0.2084827
-4.675221 'em -0.221857
-4.989297 'n -0.05809768
-5.365303 'til -0.1855581
-2.111539 </s> 0.0
-99 <s> -0.7736475
-1.128404 <unk> -0.8049794
-2.271447 a -0.6163939
-5.174762 a's -0.03869072
-3.384722 a. -0.1877073
-5.789208 a.'s 0.0
-6.000091 aachen 0.0
-4.707208 aaron -0.2046838
-5.580914 aaron's -0.06230035
-5.789208 aarons -0.07077657
-5.881973 aaronson -0.2173971

具体说明见 :ARPA的n-gram语言模型格式

整个ARPA-LM由很多个n-gram项组成,分别说明这两个的数据结构

一,n-gram数据结构

n-gram的数据结构如下:

typedef struct
{
real log_prob ;
real log_bo ;
int *words ;
} ARPALMEntry ;

words,表示当前的n-gram所涉及的单词,如果是1-gram,那就只有一个,如果是2-gram,那么words就包括这两个单词的序号。

log_bo,表示ngram的回退概率。

log_prob,表示ngram的组合概率。

二,ARPA-LM数据结构

多个项组成的整个n-gram语言模型的数据结构如下:

[cpp] view plaincopyprint?

class ARPALM
{
public:
Vocabulary *vocab ; int order ;
ARPALMEntry **entries ; // 语言模型的所有项,组成一个数组
int *n_ngrams ; // 一元语言模型、二元语言模型、三元语言模型等组成的数组,数组每一项都表示对应的的元有多少个。 char *unk_wrd ; // 词典中不在语言模型中的词。
int unk_id ;// 词典中不在语言模型中的词ID,这个ID指定为词典的最后一个序号。 int n_unk_words ;
int *unk_words ;
private:
bool *words_in_lm ; // 布尔类型数组,标识词是否在语言模型中。
}

vocab,用于构建语言模型的词典指针。词典定义见:词典内存存储模型

entries,语言模型的所有ngram项,是ARPALMEntry类型的一个二维数组。entries[0]存储1-gram,entries[1]存储2-gram,依此类推。

n_ngrams,整型数组,依次包含1-gram,2-gram,3-gram,....所包含的ngram项个数。

unk_wrd,词典中可以不在语言模型中的词。

unk_id,词典中可以不在语言模型中的词的ID,这个ID指定为词典的最后一个词序号。

n_unk_words,在读语言模型之后,统计在词典中,但没有用来建立语言模型的词个数,如果没有指定unk_wrd的话,是不允许的,就表示所有的词典中的词都应该用来建语言模型。

unk_words,存储6中统计的词序号。

words_in_lm,这个标识词典中的词是否在语言模型中出现。

N-Gram的数据结构的更多相关文章

  1. 多线程爬坑之路-学习多线程需要来了解哪些东西?(concurrent并发包的数据结构和线程池,Locks锁,Atomic原子类)

    前言:刚学习了一段机器学习,最近需要重构一个java项目,又赶过来看java.大多是线程代码,没办法,那时候总觉得多线程是个很难的部分很少用到,所以一直没下决定去啃,那些年留下的坑,总是得自己跳进去填 ...

  2. 一起学 Java(三) 集合框架、数据结构、泛型

    一.Java 集合框架 集合框架是一个用来代表和操纵集合的统一架构.所有的集合框架都包含如下内容: 接口:是代表集合的抽象数据类型.接口允许集合独立操纵其代表的细节.在面向对象的语言,接口通常形成一个 ...

  3. 深入浅出Redis-redis底层数据结构(上)

    1.概述 相信使用过Redis 的各位同学都很清楚,Redis 是一个基于键值对(key-value)的分布式存储系统,与Memcached类似,却优于Memcached的一个高性能的key-valu ...

  4. 算法与数据结构(十五) 归并排序(Swift 3.0版)

    上篇博客我们主要聊了堆排序的相关内容,本篇博客,我们就来聊一下归并排序的相关内容.归并排序主要用了分治法的思想,在归并排序中,将我们需要排序的数组进行拆分,将其拆分的足够小.当拆分的数组中只有一个元素 ...

  5. 算法与数据结构(十三) 冒泡排序、插入排序、希尔排序、选择排序(Swift3.0版)

    本篇博客中的代码实现依然采用Swift3.0来实现.在前几篇博客连续的介绍了关于查找的相关内容, 大约包括线性数据结构的顺序查找.折半查找.插值查找.Fibonacci查找,还包括数结构的二叉排序树以 ...

  6. 算法与数据结构(九) 查找表的顺序查找、折半查找、插值查找以及Fibonacci查找

    今天这篇博客就聊聊几种常见的查找算法,当然本篇博客只是涉及了部分查找算法,接下来的几篇博客中都将会介绍关于查找的相关内容.本篇博客主要介绍查找表的顺序查找.折半查找.插值查找以及Fibonacci查找 ...

  7. 算法与数据结构(八) AOV网的关键路径

    上篇博客我们介绍了AOV网的拓扑序列,请参考<数据结构(七) AOV网的拓扑排序(Swift面向对象版)>.拓扑序列中包括项目的每个结点,沿着拓扑序列将项目进行下去是肯定可以将项目完成的, ...

  8. 算法与数据结构(七) AOV网的拓扑排序

    今天博客的内容依然与图有关,今天博客的主题是关于拓扑排序的.拓扑排序是基于AOV网的,关于AOV网的概念,我想引用下方这句话来介绍: AOV网:在现代化管理中,人们常用有向图来描述和分析一项工程的计划 ...

  9. 掌握javascript中的最基础数据结构-----数组

    这是一篇<数据结构与算法javascript描述>的读书笔记.主要梳理了关于数组的知识.部分内容及源码来自原作. 书中第一章介绍了如何配置javascript运行环境:javascript ...

  10. [数据结构]——链表(list)、队列(queue)和栈(stack)

    在前面几篇博文中曾经提到链表(list).队列(queue)和(stack),为了更加系统化,这里统一介绍着三种数据结构及相应实现. 1)链表 首先回想一下基本的数据类型,当需要存储多个相同类型的数据 ...

随机推荐

  1. Python基本序列-字典

    Python 基本序列-字典 字典(dict)是"键-值 对"的无序可变序列,字典中的每个元素包含两部分,"键"和"值". 字典中的&quo ...

  2. Java基础--NIO

    NIO库在JDK1.4中引入,它以标准Java代码提供了高速的,面向块的IO,弥补了之前同步IO的不足. 缓冲区Buffer Buffers是一个对象,包含了一些要写入或读出的数据.在面向流的IO模型 ...

  3. Web应用层协议---HTTP

    处于协议栈顶层的应用层协议定义了运行在不同端系统的应用程序进程如何相互传递报文.定义内容如下: 1.交换的报文类型.请求报文和响应报文. 2.各种报文类型的语法,如报文中的各个字段及这这些字段是如何描 ...

  4. kubernetes 学习 service相关

    1:         service有什么用? 直接通过Pod的IP地址和端口号可以访问容器应用,但是pod的IP地址是不可靠的,比如POD出现故障后,有可能在另外一个NOde上启动,这样Pod的IP ...

  5. 第一个Net+Mysql的例子,比想象的简单很多

    1.window下安装mysql,比较简单,完全的图形化界面,不用看文档一路点击下来也ok,注意中间几个configtype选项就可以. 2.安装MySql Net的驱动程序程序,安装完后就是几个dl ...

  6. spring 4.0+quartz2.2 实现持久化

    最近在搭建框架 用到quartz持久化这块 查了一些文档  如下配置即可. 这里是quartz官方提供配置步骤 http://www.quartz-scheduler.org/ Quartz包含三个抽 ...

  7. 百度地图SDK的使用

    最近看了一些SDK相关的东西,就心血来潮用了一下百度地图的sdk. 百度的文档真的很有问题,配置步骤也错漏很多. 1.首先百度地图的demo一直都是和最新的android studio版本不搭的,问题 ...

  8. LeetCode题解 #8 String to Integer (atoi)

    又是一道恶心的简单题. 一开始没想到这么多情况的,幸好LeetCode是个很人性化的oj,能让你知道你在哪个case上错了,否则一辈子都过不了. 考虑不周到只能一个个补了. 列举一下恶心的case / ...

  9. openLayers 3 之入门

    openLayers 3 之入门 openlayer是web GIS客户端开发提供的javascript类库,也是开源框架,可以加载本地数据进行展示地图 1.下载相关引用的js.css文件 2.类似于 ...

  10. Elasticsearch-PHP 安装

    安装 Elasticsearch-PHP只有三个要求你需要担心: PHP 5.3.9 或更高版本(查看更多信息) Composer ext-curl: Libcurl的PHP扩展 其它的依赖会通过Co ...