How to generate a new dictionary file of mmseg
How to generate a new dictionary file of mmseg
0.Usage about mmseg-node memtioned in github :
var mmseg = require("mmseg");
var q = mmseg.open('/usr/local/etc/');
console.log(q.segmentSync("我是中文分词"));
#"/usr/local/etc" is dir of mmseg's dictionary, which has a file "uni.lib" , which is the directionary file
1. so we need a generate directionary file. Before this , we need to install coreseek , ref to http://www.coreseek.cn/products-install/install_on_bsd_linux/
安装前,建议查看:源码包说明README;4.0/4.1版可参考3.2版本安装,步骤相同;如遇到问题,请看详细安装说明。
##下载coreseek:coreseek 3.2.14:点击下载、coreseek 4.0.1:点击下载、coreseek 4.1:点击下载
$ wget http://www.coreseek.cn/uploads/csft/3.2/coreseek-3.2.14.tar.gz
$ 或者 http://www.coreseek.cn/uploads/csft/4.0/coreseek-4.0.1-beta.tar.gz
$ 或者 http://www.coreseek.cn/uploads/csft/4.0/coreseek-4.1-beta.tar.gz
$ tar xzvf coreseek-3.2.14.tar.gz 或者 coreseek-4.0.1-beta.tar.gz 或者 coreseek-4.1-beta.tar.gz
$ cd coreseek-3.2.14 或者 coreseek-4.0.1-beta 或者 coreseek-4.1-beta
##前提:需提前安装操作系统基础开发库及mysql依赖库以支持mysql数据源和xml数据源
##安装mmseg
$ cd mmseg-3.2.14
$ ./bootstrap #输出的warning信息可以忽略,如果出现error则需要解决
$ ./configure --prefix=/usr/local/mmseg3
$ make && make install
$ cd ..
##安装coreseek
$ cd csft-3.2.14 或者 cd csft-4.0.1 或者 cd csft-4.1
$ sh buildconf.sh #输出的warning信息可以忽略,如果出现error则需要解决
$ ./configure --prefix=/usr/local/coreseek --without-unixodbc --with-mmseg --with-mmseg-includes=/usr/local/mmseg3/include/mmseg/ --with-mmseg-libs=/usr/local/mmseg3/lib/ --with-mysql ##如果提示mysql问题,可以查看MySQL数据源安装说明
##debian5 : ubuntu9/10 install mysql:
$ apt-get install mysql-client libmysqlclient15-dev libxml2-dev libexpat1-dev
$ make && make install
$ cd ..
##测试mmseg分词,coreseek搜索(需要预先设置好字符集为zh_CN.UTF-8,确保正确显示中文)
$ cd testpack
$ cat var/test/test.xml #此时应该正确显示中文
$ /usr/local/mmseg3/bin/mmseg -d /usr/local/mmseg3/etc var/test/test.xml #we can see content in test.xml was divided in "system-default-knowed vocabulary" which base on dictionary file "/usr/local/mmseg3/etc/unilib".
$ /usr/local/coreseek/bin/indexer -c etc/csft.conf --all #regenerate a index
2.generate a new dictionary:
#write the new vocabulary in word_new_input.txt, each vocabulary one line and cd in where you locate your word_new_input.txt
#for example (no # at the beginning of each line):
#雅阁
#马自达
# now you cd in your new vocabulary dir:
$ cd ~/projects/mmseg-3.2.14/new2
$ cat word_new_input.txt | awk '{print $1"\t""1""\nx:1"}' > word_new_gen.txt
$ cat ../data/unigram.txt | word_new_gen.txt > word_new_gen.txt
$ /usr/local/mmseg3/bin/mmseg -u word_new_gen.txt #which generate a word_new_gen.txt.lib file
$ mv word_new_gen.txt.lib uni.lib #rename
#$ cp /usr/local/mmseg3/etc ~/ -r #backup your dictionary file
$ sudo cp uni.lib /usr/local/mmseg3/etc/ #replace the dictionary file with new one
## now you cd in your coreseek-3.2.14/testpack directory
$ /usr/local/coreseek/bin/indexer -c ~/projects/coreseek-3.2.14/testpack/etc/csft.conf --all #regenerate a new index
#above generate some output as the following:
Coreseek Fulltext 3.2 [ Sphinx 0.9.9-release (r2117)]
Copyright (c) 2007-2011,
Beijing Choice Software Technologies Inc (http://www.coreseek.com)
using config file 'etc/csft.conf'...
indexing index 'xml'...
collected 3 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 3 docs, 7585 bytes
total 0.010 sec, 746334 bytes/sec, 295.18 docs/sec
total 2 reads, 0.000 sec, 4.2 kb/call avg, 0.0 msec/call avg
total 7 writes, 0.000 sec, 3.1 kb/call avg, 0.0 msec/call avg
#new dict store in /usr/local/mmseg3/etc/
3.test the new dictionary:
3.1 file "var/test/newtest.txt" is the one has new vocabulary sentence:
$ /usr/local/mmseg3/bin/mmseg -d /usr/local/mmseg3/etc var/test/newtest.txt
雅阁/x 现在/x 卖/x 多少/x 钱/x ?/x
马自达/x 的/x 重量/x 是/x 多少/x ?/x
3.2 or you can program in coffee:
david@Wade:~/node/node$ coffee
coffee> mmseg=require('mmseg')
{ open: [Function],
clean: [Function],
uniq: [Function] }
coffee> q= mmseg.open( '/usr/local/mmseg3/etc/')
{}
coffee> console.log q.segmentSync('我喜欢开雅阁')
[ '我', '喜欢', '开', '雅阁' ]
undefined
coffee> console.log q.segmentSync('我喜欢开丰田') #丰田 is NOT in the new dictionary
[ '我', '喜欢', '开', '丰', '田' ]
undefined
coffee> console.log q.segmentSync '我喜欢开马自达'
[ '我', '喜欢', '开', '马自达' ]
How to generate a new dictionary file of mmseg的更多相关文章
- ORA-01336: specified dictionary file cannot be opened
这篇介绍使用Logminer时遇到ORA-01336: specified dictionary file cannot be opened错误的各种场景 1:dictionary_location参 ...
- generate the call load file
#!/usr/bin/perl -w $e911_call_percent = 0.0; $ims_node_number = 12; $local_ip = "10.86.52.2&quo ...
- [转] How to generate multiple outputs from single T4 template (T4 输出多个文件)
本文转自:http://www.olegsych.com/2008/03/how-to-generate-multiple-outputs-from-single-t4-template/ Updat ...
- Java 输入/输出——File类
File类是java.io包下代表与平台无关的文件和目录,也就是说,如果希望在程序中操作文件和目录,都可以通过File类来完成.值得指出的是,不管是文件还是目录都是使用File来操作的,File能新建 ...
- How to Convert a Class File to a Java File?
What is a programming language? Before introducing compilation and decompilation, let's briefly intr ...
- GPO - File Server Management
Creating disk space usage quotas: File Screening Generate Storage Report, including file edit audit. ...
- Solr搭建大数据查询平台
参考文章:http://www.freebuf.com/articles/database/100423.html 对上面链接的补充: solr-5.5.0版本已被删除,新url:http://mir ...
- Creating a radius based VPN with support for Windows clients
This article discusses setting up up an integrated IPSec/L2TP VPN using Radius and integrating it wi ...
- Wifite.py 修正版脚本代码
Kali2.0系统自带的WiFite脚本代码中有几行错误,以下是修正后的代码: #!/usr/bin/python # -*- coding: utf-8 -*- """ ...
随机推荐
- docker图形化管理工具portainer
本章主要介绍docker的web图形化管理工具.这里使用 portainer(类似与dockui不过dockerui只支持单节点) 镜像名称 portainer/portainer 一.启动porta ...
- vs2015 产品密钥
一.破解秘钥 企业版 HM6NR-QXX7C-DFW2Y-8B82K-WTYJV 专业版 HMGNV-WCYXV-X7G9W-YCX63-B98R2 二.破解步骤 1.安装vs2015 2 ...
- linux:scp从入门到刚入门
[温馨提示] 此文和ssh配合食用更佳. 首先请小伙伴们连上你要传文件的那台机,用ssh可以免密登录. [传送文件] 我们一般发文件的话可以scp来发一发,比如说我现在要向多个扔很多tomcat包,我 ...
- word-wrap与break-word属性的区别
共同点 word-wrap:break-word与word-break:break-all都能把长单词强行断句 不同点 word-wrap:break-word会首先起一个新行来放置长单词,新的行还是 ...
- [Leetcode 44]通配符匹配Wildcard Matching
[题目] 匹配通配符*,?,DP动态规划,重点是*的两种情况 想象成两个S.P长度的字符串,P匹配S. S中不会出现通配符. [条件] (1)P=null,S=null,TRUE (2)P=null, ...
- gc图波峰波谷一直上升问题
垃圾回收曲线,波峰和波谷一直上升.正常是波峰波谷在同一水平线上,可以想象如果程序继续运行下去,老年代内存回收后也不断上升,当达到老年代满了的时候,就会报内存溢出错误. 用jmap -histo pid ...
- 学习python二三事儿(二)
多个变量赋值 Python允许你同时为多个变量赋值.例如: a = b = c = 1 以上实例,创建一个整型对象,值为1,三个变量被分配到相同的内存空间上. 您也可以为多个对象指定多个变量.例如: ...
- DevExpress ASP.NET Bootstrap Controls v18.2新功能详解(一)
行业领先的.NET界面控件2018年第二次重大更新——DevExpress v18.2日前正式发布,本站将以连载的形式为大家介绍新版本新功能.本文将介绍了DevExpress ASP.NET Boot ...
- Java基础-语法定义
Java三个体系 Java SE(Java Platform,Standard Edition).Java SE 以前称为 J2SE.它允许开发和部署在桌面.服务器.嵌入式环境和实时环境中使用的 Ja ...
- 简单理解JVM与static{}
参考如下 http://www.cnblogs.com/lao-liang/p/5110710.html http://blog.csdn.net/newjerryj/article/details/ ...