【Nutch2.2.1基础教程之2.2】集成Nutch/Hbase/Solr构建搜索引擎之二：内容分析

请先参见“集成Nutch/Hbase/Solr构建搜索引擎之一：安装及运行”，搭建测试环境

http://blog.csdn.net/jediael_lu/article/details/37329731

一、被索引的域 Schema.xml

1、文档基本内容

在使用solr对Nutch抓取到的网页进行索引时，schema.xml被替换成以下内容。

文件中指定了哪些域被索引、存储等内容。

<?xml version="1.0" encoding="UTF-8" ?>

   

<schema name="nutch" version="1.5">

    <types>

        <fieldType name="string" class="solr.StrField" sortMissingLast="true"

            omitNorms="true"/> 

        <fieldType name="long" class="solr.TrieLongField" precisionStep="0"

            omitNorms="true" positionIncrementGap="0"/>

        <fieldType name="float" class="solr.TrieFloatField" precisionStep="0"

            omitNorms="true" positionIncrementGap="0"/>

        <fieldType name="date" class="solr.TrieDateField" precisionStep="0"

            omitNorms="true" positionIncrementGap="0"/>

        <fieldType name="text" class="solr.TextField"

            positionIncrementGap="100">

            <analyzer>

                <tokenizer class="solr.WhitespaceTokenizerFactory"/>

                <filter class="solr.StopFilterFactory"

                    ignoreCase="true" words="stopwords.txt"/>

                <filter class="solr.WordDelimiterFilterFactory"

                    generateWordParts="1" generateNumberParts="1"

                    catenateWords="1" catenateNumbers="1" catenateAll="0"

                    splitOnCaseChange="1"/>

                <filter class="solr.LowerCaseFilterFactory"/>

                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

            </analyzer>

        </fieldType>

        <fieldType name="url" class="solr.TextField"

            positionIncrementGap="100">

            <analyzer>

                <tokenizer class="solr.StandardTokenizerFactory"/>

                <filter class="solr.LowerCaseFilterFactory"/>

                <filter class="solr.WordDelimiterFilterFactory"

                    generateWordParts="1" generateNumberParts="1"/>

            </analyzer>

        </fieldType>

    </types>

    <fields>

        <field name="id" type="string" stored="true" indexed="true"/>

<field name="_version_" type="long" indexed="true" stored="true"/>

        <!-- core fields -->

        <field name="batchId" type="string" stored="true" indexed="false"/>

        <field name="digest" type="string" stored="true" indexed="false"/>

        <field name="boost" type="float" stored="true" indexed="false"/>

        <!-- fields for index-basic plugin -->

        <field name="host" type="url" stored="false" indexed="true"/>

        <field name="url" type="url" stored="true" indexed="true"

            required="true"/>

        <field name="content" type="text" stored="false" indexed="true"/>

        <field name="title" type="text" stored="true" indexed="true"/>

        <field name="cache" type="string" stored="true" indexed="false"/>

        <field name="tstamp" type="date" stored="true" indexed="false"/>

        <!-- fields for index-anchor plugin -->

        <field name="anchor" type="string" stored="true" indexed="true"

            multiValued="true"/>

        <!-- fields for index-more plugin -->

        <field name="type" type="string" stored="true" indexed="true"

            multiValued="true"/>

        <field name="contentLength" type="long" stored="true"

            indexed="false"/>

        <field name="lastModified" type="date" stored="true"

            indexed="false"/>

        <field name="date" type="date" stored="true" indexed="true"/>

        <!-- fields for languageidentifier plugin -->

        <field name="lang" type="string" stored="true" indexed="true"/>

        <!-- fields for subcollection plugin -->

        <field name="subcollection" type="string" stored="true"

            indexed="true" multiValued="true"/>

        <!-- fields for feed plugin (tag is also used by microformats-reltag)-->

        <field name="author" type="string" stored="true" indexed="true"/>

        <field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>

        <field name="feed" type="string" stored="true" indexed="true"/>

        <field name="publishedDate" type="date" stored="true"

            indexed="true"/>

        <field name="updatedDate" type="date" stored="true"

            indexed="true"/>

        <!-- fields for creativecommons plugin -->

        <field name="cc" type="string" stored="true" indexed="true"

            multiValued="true"/>

            

        <!-- fields for tld plugin -->    

        <field name="tld" type="string" stored="false" indexed="false"/>

    </fields>

    <uniqueKey>id</uniqueKey>

    <defaultSearchField>content</defaultSearchField>

    <solrQueryParser defaultOperator="OR"/>

</schema>

分析上述文件，主要指定了以下内容：

（1）FiledType：域的类型

（2）Field：哪些域被索引、存储等，以及这个域是什么类型。

（3）uniqueKey：哪个域作为id，即文章的唯一标识。

（4）defaultSearchField：默认的搜索域

（5）solrQueryParser：OR，即使用OR来构建Query。

2、Fileds属性分析

（1）Fileds中有2个Field是必填值

uniqueKey中指定的：id
使用了required指定的：url

（2）Fields中有8个基本Field，经过索引后，它们的状态如下：

由上可见，boost与tstamp是不被索引的，但被存储了，因此不可以被搜索，但可以呈现。

而content则被索引，但不被存储，因此可以搜索，但不可以在结果中呈现。

二、Nutch抓取的基本流程

关于更详细的抓取流程说明，请见 

Nutch2.2.1抓取流程

./bin/crawl urls csdnitpub http://ip:8983/solr/ 5

 crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>

<seedDir>：放置种子文件的目录

<crawlID> ：这个抓取任务的ID

<solrURL>：用于索引及搜索的solr地址

<numberOfRounds>：迭代次数，即抓取深度

一个nutch抓取流程如下：

（1）InjectorJob

开始第一个迭代

（2）GeneratorJob

（3）FetcherJob

（4）ParserJob

（5）DbUpdaterJob

（6）SolrIndexerJob

开始第二个迭代

（2）GeneratorJob
（3）FetcherJob
（4）ParserJob
（5）DbUpdaterJob
（6）SolrIndexerJob

开始第三个迭代

……

一个典型任务的日志如下：

InjectorJob: starting at 2014-07-08 10:41:27

InjectorJob: Injecting urlDir: urls

InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.

InjectorJob: total number of urls rejected by filters: 0

InjectorJob: total number of urls injected after normalization and filtering: 2

Injector: finished at 2014-07-08 10:41:32, elapsed: 00:00:05

Tue Jul 8 10:41:33 CST 2014 : Iteration 1 of 5

Generating batchId

Generating a new fetchlist

GeneratorJob: starting at 2014-07-08 10:41:34

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: false

GeneratorJob: normalizing: false

GeneratorJob: topN: 50000

GeneratorJob: finished at 2014-07-08 10:41:39, time elapsed: 00:00:05

GeneratorJob: generated batch id: 1404787293-26339

Fetching :

FetcherJob: starting

FetcherJob: batchId: 1404787293-26339

Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.

FetcherJob: threads: 50

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : 1404798101129

Using queue mode : byHost

Fetcher: threads: 50

QueueFeeder finished: total 2 records. Hit by time limit :0

fetching http://www.csdn.net/ (queue crawl delay=5000ms)

Fetcher: throughput threshold: -1

Fetcher: throughput threshold sequence: 5

fetching http://www.itpub.net/ (queue crawl delay=5000ms)

-finishing thread FetcherThread47, activeThreads=48

-finishing thread FetcherThread46, activeThreads=47

-finishing thread FetcherThread45, activeThreads=46

-finishing thread FetcherThread44, activeThreads=45

-finishing thread FetcherThread43, activeThreads=44

-finishing thread FetcherThread42, activeThreads=43

-finishing thread FetcherThread41, activeThreads=42

-finishing thread FetcherThread40, activeThreads=41

-finishing thread FetcherThread39, activeThreads=40

-finishing thread FetcherThread38, activeThreads=39

-finishing thread FetcherThread37, activeThreads=38

-finishing thread FetcherThread36, activeThreads=37

-finishing thread FetcherThread35, activeThreads=36

-finishing thread FetcherThread34, activeThreads=35

-finishing thread FetcherThread33, activeThreads=34

-finishing thread FetcherThread32, activeThreads=33

-finishing thread FetcherThread31, activeThreads=32

-finishing thread FetcherThread30, activeThreads=31

-finishing thread FetcherThread29, activeThreads=30

-finishing thread FetcherThread48, activeThreads=29

-finishing thread FetcherThread27, activeThreads=29

-finishing thread FetcherThread26, activeThreads=28

-finishing thread FetcherThread25, activeThreads=27

-finishing thread FetcherThread24, activeThreads=26

-finishing thread FetcherThread23, activeThreads=25

-finishing thread FetcherThread22, activeThreads=24

-finishing thread FetcherThread21, activeThreads=23

-finishing thread FetcherThread20, activeThreads=22

-finishing thread FetcherThread19, activeThreads=21

-finishing thread FetcherThread18, activeThreads=20

-finishing thread FetcherThread17, activeThreads=19

-finishing thread FetcherThread16, activeThreads=18

-finishing thread FetcherThread15, activeThreads=17

-finishing thread FetcherThread14, activeThreads=16

-finishing thread FetcherThread13, activeThreads=15

-finishing thread FetcherThread12, activeThreads=14

-finishing thread FetcherThread11, activeThreads=13

-finishing thread FetcherThread10, activeThreads=12

-finishing thread FetcherThread9, activeThreads=11

-finishing thread FetcherThread8, activeThreads=10

-finishing thread FetcherThread7, activeThreads=9

-finishing thread FetcherThread5, activeThreads=8

-finishing thread FetcherThread4, activeThreads=7

-finishing thread FetcherThread3, activeThreads=6

-finishing thread FetcherThread2, activeThreads=5

-finishing thread FetcherThread49, activeThreads=4

-finishing thread FetcherThread6, activeThreads=3

-finishing thread FetcherThread28, activeThreads=2

-finishing thread FetcherThread0, activeThreads=1

fetch of http://www.itpub.net/ failed with: java.io.IOException: unzipBestEffort returned null

-finishing thread FetcherThread1, activeThreads=0

0/0 spinwaiting/active, 2 pages, 1 errors, 0.4 0 pages/s, 93 93 kb/s, 0 URLs in 0 queues

-activeThreads=0

FetcherJob: done

Parsing :

ParserJob: starting

ParserJob: resuming:    false

ParserJob: forced reparse:      false

ParserJob: batchId:     1404787293-26339

Parsing http://www.csdn.net/

http://www.csdn.net/ skipped. Content of size 92777 was truncated to 59561

Parsing http://www.itpub.net/

ParserJob: success

CrawlDB update for csdnitpub

DbUpdaterJob: starting

DbUpdaterJob: done

Indexing csdnitpub on SOLR index -> http://ip:8983/solr/

SolrIndexerJob: starting

SolrIndexerJob: done.

SOLR dedup -> http://ip:8983/solr/

Tue Jul 8 10:42:18 CST 2014 : Iteration 2 of 5

Generating batchId

Generating a new fetchlist

GeneratorJob: starting at 2014-07-08 10:42:19

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: false

GeneratorJob: normalizing: false

GeneratorJob: topN: 50000

GeneratorJob: finished at 2014-07-08 10:42:25, time elapsed: 00:00:05

GeneratorJob: generated batch id: 1404787338-30453

Fetching :

FetcherJob: starting

FetcherJob: batchId: 1404787338-30453

Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.

FetcherJob: threads: 50

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : 1404798146676

Using queue mode : byHost

Fetcher: threads: 50

QueueFeeder finished: total 0 records. Hit by time limit :0

附一些说明：

Crawling the Web is already explained above. You can add more URLs in the seed.txt file and crawl the same.

When a user invokes a crawling command in Apache Nutch 1.x, CrawlDB is generated by Apache Nutch which is nothing but a directory and which contains details about crawling. In Apache 2.x, CrawlDB is not present. Instead, Apache Nutch keeps all the crawling
data directly in the database. In our case, we have used Apache HBase, so all crawling data would go inside Apache HBase. The following are details of how each function of crawling works.

A crawling cycle has four steps, in which each is implemented as a Hadoop MapReduce job:

• GeneratorJob

• FetcherJob

• ParserJob (optionally done while fetching using 'fetch.parse')

• DbUpdaterJob

Additionally, the following processes need to be understood:

• InjectorJob

• Invertlinks

• Indexing with Apache Solr

First of all, the job of an Injector is to populate initial rows for the web table. The InjectorJob will initialize crawldb with the URLs that we have provided. We need to run the InjectorJob by providing certain URLs, which will then be inserted into crawlDB.

Then the GeneratorJob will use these injected URLs and perform the operation. The table which is used for input and output for these jobs is called webpage, in which

every row is a URL (web page). The row key is stored as a URL with reversed host components so that URLs from the same TLD and domain can be kept together and

form a group. In most NoSQL stores, row keys are sorted and give an advantage.

Using specific rowkey filtering, scanning will be faster over a subset, rather than scanning over the entire table. Following are the examples of rowkey listing:

• org.apache..www:http/

• org.apache.gora:http/

Let's define each step in depth so that we can understand crawling step-by-step.

Apache Nutch contains three main directories, crawlDB, linkdb, and a set of segments. crawlDB is the directory which contains information about every URL that is known to Apache Nutch. If it is fetched, crawlDB contains the details when it was fetched. The
linkdatabase or linkdb contains all the links to each URL which will include source URL and also the anchor text of the link. A set of segments is a URL set, which is fetched as a unit. This directory will contain the following subdirectories:

• A crawl_generate job will be used for a set of URLs to be fetched

• A crawl_fetch job will contain the status of fetching each URL

• A content will contain the content of rows retrieved from every URL

Now let's understand each job of crawling in detail.

三、与抓取相关的一些hbase操作

（1）进入bhase shell

[root@jediael44 hbase-0.90.4]# ./bin/hbase shell

HBase Shell; enter 'help<RETURN>' for list of supported commands.

Type "exit<RETURN>" to leave the HBase Shell

Version 0.90.4, r1150278, Sun Jul 24 15:53:29 PDT 2011

（2）查看所有表

hbase(main):001:0> list

TABLE

20140710_webpage

TestCrawl_webpage

csdnitpub_webpage

stackoverflow_webpage

4 row(s) in 0.7620 seconds

（3）查看某个表的结构

hbase(main):002:0> describe '20140710_webpage'

DESCRIPTION ENABLED

{NAME => '20140710_webpage', FAMILIES => [{NAME => 'f', BLOOMFILTER => 'NONE', true

REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', TTL => '21474

83647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAM

E => 'h', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COM

PRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'fa

lse', BLOCKCACHE => 'true'}, {NAME => 'il', BLOOMFILTER => 'NONE', REPLICATION_

SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOC

KSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'mk', B

LOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION =>

'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCK

CACHE => 'true'}, {NAME => 'mtdt', BLOOMFILTER => 'NONE', REPLICATION_SCOPE =>

'0', VERSIONS => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE =>

'65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'ol', BLOOMFILTE

R => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE',

TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE =>

'true'}, {NAME => 'p', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSION

S => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_

MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 's', BLOOMFILTER => 'NONE',

REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', TTL => '21474

83647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}

1 row(s) in 0.0880 seconds

（4）查看某列的HTML文件

hbase(main):010:0* get '20140710_webpage', 'ar.com.admival.www:http/', 'f:cnt'

COLUMN CELL

f:cnt timestamp=1404983104070, value=\x0D\x0A<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:

o="urn:schemas-microsoft-com:office:office" xmlns="http://www.w3.org/TR/REC-html40">\x0D\x0

A\x0D\x0A<head>\x0D\x0A<meta http-equiv="Content-Type" content="text/html; charset=windows-

1252">\x0D\x0A<meta http-equiv="Content-Language" content="es">\x0D\x0A\x0D\x0A<title>CAMBI

O DE CHEQUES | cambio de cheque | cheque | cheques | compra de cheques de terceros | cambio

cheques | cheques no a la orden </title>\x0D\x0A<link rel="shortcut icon" ref="favivon.ico

" />\x0D\x0A<meta name="GENERATOR" content="Microsoft FrontPage 5.0">\x0D\x0A<meta name="Pr

ogId" content="FrontPage.Editor.Document">\x0D\x0A<meta name=keywords content="cambio de ch

eques, compro cheques, canje de cheques, Compro cheques de terceros, cambio cheques">\x0D\x

0A<meta name="description" http-equiv="description" content="Admival.com.ar cambia por efec

tivo sus cheques de terceros, posdatados, al d\xECa, de Sociedades, personales, cheques no

a la orden, cheques de indemnizaci\xF3n por despido, etc. " /> \x0D\x0A<meta name="robots"

content="index,follow">\

未完……

其中行的名称可以通过scan '20140710_webpage'得到。

四、ant runtime所构建的内容

1、构建Nutch

tar -zxvf apache-nutch-2.2.1-src.tar.gz

cd apache-nutch-2.2.1

ant runtime

2、 ant构建之后，生成runtime文件夹，该文件夹下面有deploy和local文件夹，分别代表了nutch的两种运行方式：

Deploy：的数据必须运行在Hadoop的HDFS中

local：是运行在本地目录中。

（1）二者的目录结构如下：

[jediael@jediael44 runtime]$ ls deploy/ local/

deploy/:

apache-nutch-2.2.1.job bin

local/:

bin conf lib logs plugins test

在deploy中，文件被打包成一个Job，作为Hadoop的一个Job来运行。

（2）二者目录下均有一个bin的目录，其内包含相同的crawl与nutch两个执行文件。

我们查看nutch文件的最后几行

if $local; then

# fix for the external Xerceslib issue with SAXParserFactory

NUTCH_OPTS="-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl$NUTCH_OPTS"

EXEC_CALL="$JAVA$JAVA_HEAP_MAX $NUTCH_OPTS -classpath $CLASSPATH"

else

# check that hadoop can befound on the path

if [ $(which hadoop | wc -l )-eq 0 ]; then

echo "Can't findHadoop executable. Add HADOOP_HOME/bin to the path or run in local mode."

exit -1;

# distributed mode

EXEC_CALL="hadoop jar$NUTCH_JOB"

# run it

exec $EXEC_CALL $CLASS "$@"

即默认情况下为 EXEC_CALL="hadoop jar$NUTCH_JOB"，若为Local，则 EXEC_CALL="$JAVA$JAVA_HEAP_MAX $NUTCH_OPTS -classpath $CLASSPATH"，若未local，且hadoop不存在，则报错。

（3）根据参数确定类文件

if [ "$COMMAND" = "crawl" ] ; then

CLASS=org.apache.nutch.crawl.Crawler

elif [ "$COMMAND" = "inject" ] ; then

CLASS=org.apache.nutch.crawl.InjectorJob

elif [ "$COMMAND" = "hostinject" ] ; then

CLASS=org.apache.nutch.host.HostInjectorJob

elif [ "$COMMAND" = "generate" ] ; then

CLASS=org.apache.nutch.crawl.GeneratorJob

elif [ "$COMMAND" = "fetch" ] ; then

CLASS=org.apache.nutch.fetcher.FetcherJob

elif [ "$COMMAND" = "parse" ] ; then

CLASS=org.apache.nutch.parse.ParserJob

elif [ "$COMMAND" = "updatedb" ] ; then

CLASS=org.apache.nutch.crawl.DbUpdaterJob

elif [ "$COMMAND" = "updatehostdb" ] ; then

CLASS=org.apache.nutch.host.HostDbUpdateJob

elif [ "$COMMAND" = "readdb" ] ; then

CLASS=org.apache.nutch.crawl.WebTableReader

elif [ "$COMMAND" = "readhostdb" ] ; then

CLASS=org.apache.nutch.host.HostDbReader

elif [ "$COMMAND" = "elasticindex" ] ; then

CLASS=org.apache.nutch.indexer.elastic.ElasticIndexerJob

elif [ "$COMMAND" = "solrindex" ] ; then

CLASS=org.apache.nutch.indexer.solr.SolrIndexerJob

elif [ "$COMMAND" = "solrdedup" ] ; then

CLASS=org.apache.nutch.indexer.solr.SolrDeleteDuplicates

elif [ "$COMMAND" = "parsechecker" ] ; then

CLASS=org.apache.nutch.parse.ParserChecker

elif [ "$COMMAND" = "indexchecker" ] ; then

CLASS=org.apache.nutch.indexer.IndexingFiltersChecker

elif [ "$COMMAND" = "plugin" ] ; then

CLASS=org.apache.nutch.plugin.PluginRepository

如，对于nutch fetch命令，对应的类文件应该是：org.apache.nutch.fetcher.FetcherJob

[jediael@jediael44 java]$ cat org/apache/nutch/fetcher/FetcherJob.java

可以查看类文件。此方法可以查看一切的shell对应的源文件。

五、修改Solr中browse返回内容

1、从content域中搜索

从solr的example中得到的solrConfig.xml中，qf的定义如下：

[html] view
plain copy

<str name="qf">
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
title^10.0 description^5.0 keywords^5.0 author^2.0 resourcename^1.0
</str>

由于content不占任何的权重，因此如果某个文档只在content中包含关键字的话，搜索结果并不会返回这个文档。因此，对于nutch提取的索引来说，要增加content的权重，以及url的权重（如果需要的话）：

[html] view
plain copy

<str name="qf">
content^1.0 text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
title^10.0 description^5.0 keywords^5.0 author^2.0 resourcename^1.0
</str>

2、保存网页的content内容

将schema.xml中的

 <field name="content" type="text" stored="false" indexed="true"/>

改为

        <field name="content" type="text" stored="true" indexed="true"/>

3、同时显示网页文件与一般文本

velocity/results_list.vm

##parse("hit_plain.vm")

将注释去掉。

4、调整每个搜索返回项的显示内容

vi richtest_doc.vm

<div>

  Id: #field('id')

</div>

改成：

<div>

  time: #field('tstamp')

</div>

<div>

  score: #field('score')

</div>

这个方法可以修改其它字段，详见http://blog.csdn.net/jediael_lu/article/details/38039267

1、从content域中搜索

从solr的example中得到的solrConfig.xml中，qf的定义如下：

[html] view
plain copy

<str name="qf">
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
title^10.0 description^5.0 keywords^5.0 author^2.0 resourcename^1.0
</str>

[html] view
plain copy

<str name="qf">
content^1.0 text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
title^10.0 description^5.0 keywords^5.0 author^2.0 resourcename^1.0
</str>

2、保存网页的content内容

将schema.xml中的

 <field name="content" type="text" stored="false" indexed="true"/>

改为

        <field name="content" type="text" stored="true" indexed="true"/>

3、同时显示网页文件与一般文本

velocity/results_list.vm

##parse("hit_plain.vm")

将注释去掉。

4、调整每个搜索返回项的显示内容

vi richtest_doc.vm

<div>

  Id: #field('id')

</div>

改成：

<div>

  time: #field('tstamp')

</div>

<div>

  score: #field('score')

</div>

这个方法可以修改其它字段，详见http://blog.csdn.net/jediael_lu/article/details/38039267

【Nutch2.2.1基础教程之2.2】集成Nutch/Hbase/Solr构建搜索引擎之二：内容分析的更多相关文章

【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一：安装及运行【单机环境】
1.下载相关软件,并解压版本号如下: (1)apache-nutch-2.2.1 (2) hbase-0.90.4 (3)solr-4.9.0 并解压至/usr/search 2.Nutch的配置 ...
【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件
nutch-site.xml 在nutch2.2.1中,有两份配置文件:nutch-default.xml与nutch-site.xml. 其中前者是nutch自带的默认属性,一般情况下不要修改. 如 ...
【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件分类： H3_NUTCH 2014-08-18 16:33 1376人阅读评论(0) 收藏
nutch-site.xml 在nutch2.2.1中,有两份配置文件:nutch-default.xml与nutch-site.xml. 其中前者是nutch自带的默认属性,一般情况下不要修改. 如 ...
【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程
一.抓取流程概述 1.nutch抓取流程当使用crawl命令进行抓取任务时,其基本流程步骤如下: (1)InjectorJob 开始第一个迭代 (2)GeneratorJob (3)FetcherJ ...
【Nutch2.2.1基础教程之1】nutch相关异常
1.在任务一开始运行,注入Url时即出现以下错误. InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.go ...
【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程分类： H3_NUTCH 2014-08-15 21:39 2530人阅读评论(1) 收藏
一.抓取流程概述 1.nutch抓取流程当使用crawl命令进行抓取任务时,其基本流程步骤如下: (1)InjectorJob 开始第一个迭代 (2)GeneratorJob (3)FetcherJ ...
【Nutch2.2.1基础教程之1】nutch相关异常分类： H3_NUTCH 2014-08-08 21:46 1549人阅读评论(2) 收藏
1.在任务一开始运行,注入Url时即出现以下错误. InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.go ...
【Nutch2.3基础教程】集成Nutch/Hadoop/Hbase/Solr构建搜索引擎：安装及运行【集群环境】
1.下载相关软件,并解压版本号如下: (1)apache-nutch-2.3 (2) hadoop-1.2.1 (3)hbase-0.92.1 (4)solr-4.9.0 并解压至/opt/jedi ...
OpenVAS漏洞扫描基础教程之OpenVAS概述及安装及配置OpenVAS服务
OpenVAS漏洞扫描基础教程之OpenVAS概述及安装及配置OpenVAS服务 1. OpenVAS基础知识 OpenVAS(Open Vulnerability Assessment Sys ...

随机推荐

linux 下配置mysql区分大小写（不区分可能出现找不到表的情况）怎么样使用yum来安装mysql
Linux 默认情况下,数据库是区分大小写的:因此,要将mysql设置成不区分大小写在my.cof 设置 lower_case_table_names=1(1忽略大小写,0区分大小写) 检查方式:在 ...
sqlserver查询编辑器编辑数据
1.我想编辑这几行的sortid,方式可以直接写sql,但是还有一种更简洁的方法,如下: 2.用这种方式可以直接修改,比较方便. 3.总结:要做一件事情,可能有很多种方法.而且很有可能有简单的方法,如 ...
iOS 面试题集合
ASIDownloadCache 设置下载缓存它对Get请求的响应数据进行缓存(被缓存的数据必需是成功的200请求): [ASIHTTPRequest setDefaultCache:[ASID ...
Core Data (一)备
序恩,用Core Data也有一段时间了.大大小小的坑也都坑过了.重来没有认真的记录一次.这次需要好好的理一理Core Data.就当一次绝好的机会记录下来.也为了自己加深认识. 为什么要用Core ...
开心菜鸟学习系列－－－－－javascript(2)
最小全局变量 : 1)每个javascript环境有一个全局对象,当你在任意的函数外面使用this的时候可以访问到,你创建的每一个全部变量都成了这个全局对象的属性,在浏览器中,方便起见, ...
find之exec和args
本来以为以前的差不多够用了.呵呵,看到很多高手用高技巧,心痒痒的觉得我自己还可以提升啊..哈哈哈. 这个实践起来之后,,SED,AWK也得深化一下,,,SHELL和PYTHON,作运维的两样都不能废. ...
BZOJ 1017 魔兽地图DotR（树形DP）
题意:有两类装备,高级装备A和基础装备B.现在有m的钱.每种B有一个单价和可以购买的数量上限.每个Ai可以由Ci种其他物品合成,给出Ci种其他物品每种需要的数量.每个装备有一个贡献值.求最大的贡献值. ...
BZOJ 1502 月下柠檬树(simpson积分)
题目链接:http://61.187.179.132/JudgeOnline/problem.php?id=1502 题意:给出如下一棵分层的树,给出每层的高度和每个面的半径.光线是平行的,与地面夹角 ...
编译不通过：提示XXXX不是类或命名空间名的解决办法
手动写了一个类,需要引入预编译头stdafx.h.结果编译时提示XXXX不是类或命名空间名. 处理方法:将#include "stdafx.h"放在最前面.
图论专题训练1-D(K步最短路,矩阵连乘)
题目链接 /* *题目大意: *求出从i到j,刚好经过k条边的最短路; * *矩阵乘法的应用之一(国家队论文): *矩阵乘法不满足交换律,矩阵乘法满足结合律; *给定一个有向图,问从A点恰好走k步(允 ...

【Nutch2.2.1基础教程之2.2】集成Nutch/Hbase/Solr构建搜索引擎之二：内容分析

Nutch2.2.1抓取流程

【Nutch2.2.1基础教程之2.2】集成Nutch/Hbase/Solr构建搜索引擎之二：内容分析的更多相关文章

随机推荐

热门专题