https://github.com/yandex/faster-rnnlm

 
 

Gdb ./rnnlm

r -rnnlm model-good.faster -train thread.title.good.train.txt -valid thread.title.good.valid.txt -hidden 5- -direct-order 3 -direct 200 -bptt 4 -bptt-block 10 -threads 1

[root@cq01-forum-rstree01.cq01.baidu.com faster-rnnlm]# more thread.title.good.train.txt

唉        稳        凉菜        干货        批发        稳        左成        个        月        都

咦        丢        图        跑

毕竟        新人

我        想去旅行

昨天        玩        个        满        深渊        人马                 才        踩        了        55

这        状态        还        不如        温网

新型        投资项目

晒        早饭        就        酱

渣土        哥        真是        太        放肆        了

推荐        就是        有        这样        的

白素贞        水        漫        文水        城

我知道        那些夏天        就像        你        一样        回        不

渑池        至        洛阳        最早        的        车        几        点        哪里        坐        到        洛阳        几点

宏观        方面        大        的        流动性        格局        虽无        明显        变化        但        眼下        地方        政府        债务        限

电工        行业        竞争        大        锦力        电器        有        优势

兄弟        啊                 影技        1        班        q        群        是        多少

你们        家乡        话        叫        什么

深深        的        孤独感        与        挫败        感        感觉        个人

一起去        旅游        吧

谁知道        四会        那里        有        修        打火机        的

[root@cq01-forum-rstree01.cq01.baidu.com faster-rnnlm]# pwd

/home/users/chenghuige/other/faster-rnnlm.debug/faster-rnnlm

  1. 统计词频建立vocabulary

void Vocabulary::BuildFromCorpus(const std::string& fpath, bool show_progress)

首先添加一个 </s>

AddWord(kEOSTag);
只是编号0

 
 

然后逐个添加每行

每行处理的时候按照IsSpace切分

inline bool IsSpace(char c) {

return c == ' ' || c == '\r' || c == '\t' || c == '\n';

 
 

然后其实就是对每个词
类似 Identifer.h那样顺序编号,没出现的词
叫做oov 编号 -1

 
 

除了编号之外
同时统计频次

最后按照频次排序
从大到小
同时更新编号
也就是频次最大的
这里 </s> 编号为0

(gdb) p words_

, word = 0x6ae1c0 "</s>"}, {

freq = 258246, word = 0x6aef20 "\265\304"}, {freq = 126910, word = 0x6aeff0 "\301\313"}, {

freq = 101904, word = 0x6aedc0 "\316\322"}, {freq = 67328, word = 0x6aeee0 "\323\320"}, {

freq = 62290, word = 0x6aec10 "\270\366"}, {freq = 60866, word = 0x6afb20 "\322\273"}, {

 
 

[root@cq01-forum-rstree01.cq01.baidu.com faster-rnnlm]# wc -l thread.title.good.train.txt

thread.title.good.train.txt

 
 

gdb) p cfg

$2 = {layer_size = 5, layer_count = 1, maxent_hash_size = 199947228, maxent_order = 3, use_nce = false, nce_lnz = 9, reverse_sentence = false, layer_type = "sigmoid"}

 
 

  1. 构建网格结构

main_nnet = new NNet(vocab, cfg, use_cuda, use_cuda_memory_efficient);

构造函数调用Init 在这里

 
 

embeddings.resize(vocab.size(), cfg.layer_size);

//(word_num, hidden_size) 二维数组

 
 

rec_layer = CreateLayer(cfg.layer_type, cfg.layer_size, cfg.layer_count);

//隐层
建立一个layer 默认layer_type是sigmoid

 
 

maxent_layer.Init(cfg.maxent_hash_size);

//最大熵
@TODO

 
 

softmax_layer = HSTree::CreateHuffmanTree(vocab, cfg.layer_size);

//输出层 softmax 采用huffman树

 
 

 
 

Faster-rnnlm代码分析1 - 词表构建,Nnet成员的更多相关文章

  1. tensorflow faster rcnn 代码分析一 demo.py

    os.environ["CUDA_VISIBLE_DEVICES"]=2 # 设置使用的GPU tfconfig=tf.ConfigProto(allow_soft_placeme ...

  2. 完整全面的Java资源库(包括构建、操作、代码分析、编译器、数据库、社区等等)

    构建 这里搜集了用来构建应用程序的工具. Apache Maven:Maven使用声明进行构建并进行依赖管理,偏向于使用约定而不是配置进行构建.Maven优于Apache Ant.后者采用了一种过程化 ...

  3. tensorflow笔记:多层LSTM代码分析

    tensorflow笔记:多层LSTM代码分析 标签(空格分隔): tensorflow笔记 tensorflow笔记系列: (一) tensorflow笔记:流程,概念和简单代码注释 (二) ten ...

  4. Android代码分析工具lint学习

    1 lint简介 1.1 概述 lint是随Android SDK自带的一个静态代码分析工具.它用来对Android工程的源文件进行检查,找出在正确性.安全.性能.可使用性.可访问性及国际化等方面可能 ...

  5. 常用 Java 静态代码分析工具的分析与比较

    常用 Java 静态代码分析工具的分析与比较 简介: 本文首先介绍了静态代码分析的基 本概念及主要技术,随后分别介绍了现有 4 种主流 Java 静态代码分析工具 (Checkstyle,FindBu ...

  6. angular代码分析之异常日志设计

    angular代码分析之异常日志设计 错误异常是面向对象开发中的记录提示程序执行问题的一种重要机制,在程序执行发生问题的条件下,异常会在中断程序执行,同时会沿着代码的执行路径一步一步的向上抛出异常,最 ...

  7. [Asp.net 5] DependencyInjection项目代码分析4-微软的实现(3)

    这个系列已经写了5篇,链接地址如下: [Asp.net 5] DependencyInjection项目代码分析 [Asp.net 5] DependencyInjection项目代码分析2-Auto ...

  8. wifi display代码 分析

    转自:http://blog.csdn.net/lilian0118/article/details/23168531 这一章中我们来看Wifi Display连接过程的建立,包含P2P的部分和RTS ...

  9. Device Tree(三):代码分析【转】

    转自:http://www.wowotech.net/linux_kenrel/dt-code-analysis.html Device Tree(三):代码分析 作者:linuxer 发布于:201 ...

随机推荐

  1. K-V-O 键值观察机制

    在两个不同的控制器之间传值是iOS开发中常有的情况,应对这种情况呢,有多种的应对办法.kvc就是其中的一种,所以,我们就在此解释之.   key value observing  键值观察,给人一种高 ...

  2. Homework

    #include<stdio.h> #include<math.h> int main() { int a,b,c,l,p,s; printf("请输入三个数:&qu ...

  3. caffe学习系列(3):数据层介绍

    一个模型由多个层构成,如Data,conv,pool等.其中数据层是模型的最底层,是模型的入口. 提供数据的输入,也提供数据从Blobs转换成别的格式进行保存输出还包括数据的预处理(如减去 均值, 放 ...

  4. HDU——PKU题目分类

    HDU 模拟题, 枚举1002 1004 1013 1015 1017 1020 1022 1029 1031 1033 1034 1035 1036 1037 1039 1042 1047 1048 ...

  5. BZOJ 4455: [Zjoi2016]小星星

    Sol 容斥原理+树形DP. 这道题用的容斥思想非常妙啊!主要的思路就是让所有点与S集合中的点对应,可以重复对应,并且可以不用对应完全(意思是是S的子集也可以).这样他有未对应完全的,那就减去,从全都 ...

  6. mysql中like用法

    like 的通配符有两种 %(百分号):代表零个.一个或者多个字符. _(下划线):代表一个数字或者字符. 1. name以"李"开头 where name like '李%' 2 ...

  7. thinkphp save()方法没有数据,保存失败解决办法

    thinkphp save()方法没有数据保存返回0,保存失败返回false   可以对返回值判断一下就好 $ret = $model->save($data); //var_dump($ret ...

  8. zend studio安装svn插件

    进入zend->Help->Install New Software 输入如下地址: http://subclipse.tigris.org/update_1.8.x 选择subclips ...

  9. 【GoLang】golang 微服务框架 介绍

    原文如下: rpcx是一个类似阿里巴巴 Dubbo 和微博 Motan 的分布式的RPC服务框架,基于Golang net/rpc实现. 谈起分布式的RPC框架,比较出名的是阿里巴巴的dubbo,包括 ...

  10. string转DateTime(时间格式转换)

    1.不知道为什么时间在数据库用varchar(8)来保存,例如"19900505",但是这样的保存格式在处理时间的时候是非常不方便的. 但是转换不能用Convert.ToDateT ...