原文地址: http://nlp.solutions.asia/?p=180

These instructions assume Ubuntu 12.04 and Java 6 or 7 installed and JAVA_HOME configured.

Install MySQL Server and MySQL Client using the Ubuntu software center or sudo apt-get install mysql-server mysql-client at the command line.

As MySQL defaults to latin (are we still in the 1990s?) we need to edit sudo vi /etc/mysql/my.cnf and under [mysqld] add

innodb_file_format=barracuda
innodb_file_per_table=true
innodb_large_prefix=true
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
max_allowed_packet=500M

The innodb options are to help deal with the small primary key size restriction of MySQL. Restart your machine for the changes to take effect. The max_allowed_packet option is so you don’t run into issues as your database and the pages you store in it get larger.

Check to make sure MySQL is running by typing sudo netstat -tap | grep mysql  and you should see something like

tcp        0      0 localhost:mysql         *:*                     LISTEN

We need to set up the nutch database manually as the current Nutch/Gora/MySQL generated db schema defaults to latin. Log into mysql at the command line using your previously set up MySQL id and password type

mysql -u xxxxx -p

then in the MySQL editor type the following:

CREATE DATABASE nutch DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci;

and enter followed by

use nutch;

and enter and then copy and paste the following altogether:

CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` mediumtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
ROW_FORMAT=COMPRESSED
DEFAULT CHARSET=utf8mb4;

Then type enter. You are done setting up the MySQL database for Nutch.

Set up Nutch 2.1 by downloading the latest version from http://www.apache.org/dyn/closer.cgi/nutch/. Untar the contents of the file you just downloaded and going forward we will refer to this folder as ${APACHE_NUTCH_HOME}.

From inside the nutch folder ensure the MySQL dependency for Nutch is available by editing the following in ${APACHE_NUTCH_HOME}/ivy/ivy.xml

<!– Uncomment this to use MySQL as database with SQL as Gora store. –>
<dependency org=”mysql” name=”mysql-connector-java” rev=”5.1.18″ conf=”*->default”/>

Edit the ${APACHE_NUTCH_HOME}/conf/gora.properties file either deleting or commenting out the Default SqlStore Properties using #. Then add the MySQL properties below replacing xxxxx with the user and password you set up when installing MySQL earlier.

###############################
# MySQL properties            #
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=xxxxx
gora.sqlstore.jdbc.password=xxxxx

Edit the ${APACHE_NUTCH_HOME}/conf/gora-sql-mapping.xml file changing the length of the primarykey from 512 to 767 in both places.
<primarykey column=”id” length=”767″/>

Configure ${APACHE_NUTCH_HOME}/conf/nutch-site.xml to put in a name in the value field under http.agent.name. It can be anything but cannot be left blank. Add additional languages if you want (I have added Japanese ja-jp below) and utf-8 as default as well. You must specify Sqlstore.

<property>
<name>http.agent.name</name>
<value>Your Nutch Spider</value>
</property>

<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>

Install ant using the Ubuntu software center or sudo apt-get install ant at the command line.

From the command line cd to your nutch folder type ant runtime
This may take a few minutes to compile.

Start your first crawl by typing the lines below at the terminal (replace ‘http://nutch.apache.org/’ with whatever site you want to crawl):

cd ${APACHE_NUTCH_HOME}/runtime/local
mkdir -p urls
echo 'http://nutch.apache.org/' > urls/seed.txt
bin/nutch crawl urls -depth 3 -topN 5

You can easily add more urls to search by hand in seed.txt if you want. For the crawl, depth is the number of rounds of generate/fetch/parse/update you want to do (not depth of links as you might think at first) and topNis the max number of links you want to actually parse each time. Note however Nutch keeps track of all links it encounters in the webpage table (it just limits the amount it actually parses to TopN so don’t be surprised by seeing many more rows in the webpage table than you expect by limiting with TopN).

Check your crawl results by looking at the webpage table in the nutch database.

mysql -u xxxxx -p
use nutch;
SELECT * FROM nutch.webpage;

You should see the results of your crawl (around 159 rows). It will be hard to read the columns so you may want to install MySQL Workbench via sudo apt-get install mysql-workbench and use that instead for viewing the data. You may also want to run the following SQL command select * from webpage where status = 2; to limit the rows in the webpage table to only urls that were actually parsed.

Set up and index with Solr If you are using Nutch 2.1 at this time you are into the bleeding edge and probably want the latest version of Solr 4.0 as well. Untar it to to $HOME/apache-solr-4.0.0-XXXX. This folder will be now referred to as ${APACHE_SOLR_HOME}.
Download http://nlp.solutions.asia/wp-content/uploads/2012/08/schema.xml  and use it to replace ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.

From the terminal start solr:
cd ${APACHE_SOLR_HOME}/example
java -jar start.jar

You can check this is running by opening http://localhost:8983/solr in your web browser.

Leave that terminal running and from a different terminal type the following:
cd ${APACHE_NUTCH_HOME}/runtime/local/
bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex

You can now run queries using Solr versus your crawled content. Openhttp://localhost:8983/solr/#/collection1/query and assuming you have crawled nutch.apache.org in the input box titled “q” you can do a search by inputting text:nutch and you should see something like this:

There remains a lot to configure to get a good web search going but you are at least started.

Setting up Nutch 2.1 with MySQL to handle UTF-8的更多相关文章

  1. 安装mysql sever 向导失败,最后一步无响应

    在配置apache+php+mysql环境的时候,apache和php都可以运行,这里提供两个安装教程(window环境) http://apps.hi.baidu.com/share/detail/ ...

  2. 浅谈mysql的两阶段提交协议

    前两天和百度的一个同学聊MySQL两阶段提交,当时自信满满的说了一堆,后来发现还是有些问题的理解还是比较模糊,可能是因为时间太久了,忘记了吧.这里再补一下:) 5.3.1事务提交流程 MySQL的事务 ...

  3. MySQL5中大数据错误:Packet for query is too large (****** > ******). You can change this value on the server by setting the max_allowed_packet' variable.;

    使用的MySQL数据库版本:5.5 插入或更新字段有大数据时(大于1M),会出现如下错误: ### Cause: com.mysql.jdbc.PacketTooBigException: Packe ...

  4. mysql start server faild

    可能没卸载干净...在安装mysql数据库时,如果重新安装,很容易遇见apply security setting error,即在配置mysql启动服务时,在启动apply security set ...

  5. mysql安装过程中出现的错误问题解决方案

    最近在学Django,因为与数据库相关,所以我下载并安装了MySQL,安装的过程真的是一把辛酸泪啊.安装过后,查看是否可以使用,出现了cann't connect to mysql server这个错 ...

  6. mysql的两阶段提交协议

    http://www.cnblogs.com/hustcat/p/3577584.html   前两天和百度的一个同学聊MySQL两阶段提交,当时自信满满的说了一堆,后来发现还是有些问题的理解还是比较 ...

  7. 初次配置eclipse, jdk, tomcat, maven, mysql, alt+/

    eclipse 官网下载eclipse-inst-win64.exe, 选择安装java ee. jdk 官网下载jdk-8u102-windows-x64.exe, next到底. 接下来配置环境变 ...

  8. MySQL 5.7 在windows下修改max_allowed_packet变量

    (一)执行sql遇到的错误如下: ### Cause: com.mysql.jdbc.PacketTooBigException: Packet for query is too large (387 ...

  9. django中sqlite迁移mysql

    sqlite数据迁移 1 数据备份 django中打开terminalpython manage.py dumpdata authorization > authorization_data.j ...

随机推荐

  1. 细数Android开源项目中那些频繁使用的并发库中的类

    这篇blog旨在帮助大家 梳理一下前面分析的那些开源代码中喜欢使用的一些类,这对我们真正理解这些项目是有极大好处的,以后遇到类似问题 我们就可以自己模仿他们也写 出类似的代码. 1.ExecutorS ...

  2. Android 自定义view中的属性,命名空间,以及tools标签

    昨日看到有人在知乎上问这3个琐碎的小知识点,今天索性就整理了一下,其实这些知识点并不难,但是很多开发者平时很少注意到这些, 导致的后果就是开发的时候 经常会被ide报错,开发效率很低,或者看开源代码的 ...

  3. 设计模式-单键(Singleton)

    [摘要]   在软件系统中,经常有这样一些特殊的类,必须保证它们在系统中只存在一个实例,才能确保它们的逻辑正确性.以及良好的效率. 如何绕过常规的构造器,提供一种机制来保证一个类只有一个实例? 这应该 ...

  4. MySQL · 性能优化· InnoDB buffer pool flush策略漫谈

    MySQL · 性能优化· InnoDB buffer pool flush策略漫谈 背景 我们知道InnoDB使用buffer pool来缓存从磁盘读取到内存的数据页.buffer pool通常由数 ...

  5. Effective java笔记4--方法

    一.检查参数的有效性 极大多数方法和构造函数都会对于传递给它们的参数值有某些限制. 对于公有的方法,使用Javadoc @throws标签(tag)可以使文档中记录下“一旦针对参数值的限制被违反之后将 ...

  6. Linux基本命令 目录

    Linux基本命令 目录 Linux基本命令(1)管理文件和目录的命令 Linux基本命令(2)有关磁盘控件的命令 Linux基本命令(3)文件备份和压缩的命令 Linux基本命令(4)有关关机和查看 ...

  7. iframe根据子页面自动调整大小

    //iframe高度自适应 function IFrameReSize(iframename) { var pTar = document.getElementById(iframename); if ...

  8. 20+富有创意的BuddyPress网站

    如果你想构建自己的社区网站,如果你熟悉WordPress,那么用BuddyPress构建它吧!它确实太强大了,本文整理了20个富有创意的BuddyPress网站,看看它们,你也能拥有! 原文地址:ht ...

  9. linux下使用libiconv库转码

    iconv命令实现linux下字符集编码的转换 windows下的文件复制到linux下时常会乱码,因为windows下文件编码为GBK,linux下默认文件编码为UTF-8,故需要libiconv库 ...

  10. Linux下动态库生成和使用

    Linux下动态库生成和使用 一.动态库的基本概念 1.动态链接库是程序运行时加载的库,当动态链接库正确安装后,所有的程序都可以使用动态库来运行程序.动态链接库是目标文件的集合,目标文件在动态链接库中 ...