Nutch的nutch-default.xml和regex-urlfilter.txt的中文解释
nutch-default解释.xml
- <?xml version="1.0"?>
- <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
- <!--
- Licensed to the Apache Software Foundation (ASF) under one or more
- contributor license agreements. See the NOTICE file distributed with
- this work for additional information regarding copyright ownership.
- The ASF licenses this file to You under the Apache License, Version 2.0
- (the "License"); you may not use this file except in compliance with
- the License. You may obtain a copy of the License at
- http://www.apache.org/licenses/LICENSE-2.0
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
- -->
- <!-- Do not modify this file directly. Instead, copy entries that you -->
- <!-- wish to modify from this file into nutch-site.xml and change them -->
- <!-- there. If nutch-site.xml does not already exist, create it. -->
- <configuration>
- <!-- general properties -->
- <property>
- <name>store.ip.address</name>
- <value>false</value>
- <description>Enables us to capture the specific IP address
- (InetSocketAddress) of the host which we connect to via
- the given protocol. Currently supported is protocol-ftp and
- http.
- </description>
- </property>
- <!-- file properties -->
- <property>
- <name>file.content.limit</name>
- <value>65536</value>
- <description>The length limit for downloaded content using the file://
- protocol, in bytes. If this value is nonnegative (>=0), content longer
- than it will be truncated; otherwise, no truncation at all. Do not
- confuse this setting with the http.content.limit setting.
- 默认情况下,nutch只抓取网页的前65536个字节,之后的内容将被丢弃
- 但对于某些大型网站,首页的内容远远不止65536个字节,
- 甚至前面65536个字节里面均是一些布局信息,并没有任何的超链接,
- 所以可以把这个值设置的很大,或者直接设置为-1,表示不进行限制。
- </description>
- </property>
- <property>
- <name>file.crawl.parent</name>
- <value>true</value>
- <description>The crawler is not restricted to the directories that you specified in the
- Urls file but it is jumping into the parent directories as well. For your own crawlings you can
- change this bahavior (set to false) the way that only directories beneath the directories that you specify get
- crawled.</description>
- </property>
- <property>
- <name>file.crawl.redirect_noncanonical</name>
- <value>true</value>
- <description>
- If true, protocol-file treats non-canonical file names as
- redirects and does not canonicalize file names internally. A file
- name containing symbolic links as path elements is then not
- resolved and "fetched" but recorded as redirect with the
- canonical name (all links on path are resolved) as redirect
- target.
- </description>
- </property>
- <property>
- <name>file.content.ignored</name>
- <value>true</value>
- <description>If true, no file content will be saved during fetch.
- And it is probably what we want to set most of time, since file:// URLs
- are meant to be local and we can always use them directly at parsing
- and indexing stages. Otherwise file contents will be saved.
- !! NO IMPLEMENTED YET !!
- </description>
- </property>
- <!-- HTTP properties -->
- <property>
- <name>http.agent.name</name>
- <value></value>
- <description>HTTP 'User-Agent' request header. MUST NOT be empty -
- please set this to a single word uniquely related to your organization.
- NOTE: You should also check other related properties:
- http.robots.agents
- http.agent.description
- http.agent.url
- http.agent.email
- http.agent.version
- and set their values appropriately.
- </description>
- </property>
- <property>
- <name>http.robots.agents</name>
- <value></value>
- <description>Any other agents, apart from 'http.agent.name', that the robots
- parser would look for in robots.txt. Multiple agents can be provided using
- comma as a delimiter. eg. mybot,foo-spider,bar-crawler
- The ordering of agents does NOT matter and the robots parser would make
- decision based on the agent which matches first to the robots rules.
- Also, there is NO need to add a wildcard (ie. "*") to this string as the
- robots parser would smartly take care of a no-match situation.
- If no value is specified, by default HTTP agent (ie. 'http.agent.name')
- would be used for user agent matching by the robots parser.
- </description>
- </property>
- <property>
- <name>http.robot.rules.whitelist</name>
- <value></value>
- <description>Comma separated list of hostnames or IP addresses to ignore
- robot rules parsing for. Use with care and only if you are explicitly
- allowed by the site owner to ignore the site's robots.txt!
- </description>
- </property>
- <property>
- <name>http.robots.403.allow</name>
- <value>true</value>
- <description>Some servers return HTTP status 403 (Forbidden) if
- /robots.txt doesn't exist. This should probably mean that we are
- allowed to crawl the site nonetheless. If this is set to false,
- then such sites will be treated as forbidden.</description>
- </property>
- <property>
- <name>http.agent.description</name>
- <value></value>
- <description>Further description of our bot- this text is used in
- the User-Agent header. It appears in parenthesis after the agent name.
- </description>
- </property>
- <property>
- <name>http.agent.url</name>
- <value></value>
- <description>A URL to advertise in the User-Agent header. This will
- appear in parenthesis after the agent name. Custom dictates that this
- should be a URL of a page explaining the purpose and behavior of this
- crawler.
- </description>
- </property>
- <property>
- <name>http.agent.email</name>
- <value></value>
- <description>An email address to advertise in the HTTP 'From' request
- header and User-Agent header. A good practice is to mangle this
- address (e.g. 'info at example dot com') to avoid spamming.
- </description>
- </property>
- <property>
- <name>http.agent.version</name>
- <value>Nutch-1.10</value>
- <description>A version string to advertise in the User-Agent
- header.</description>
- </property>
- <property>
- <name>http.agent.rotate</name>
- <value>false</value>
- <description>
- If true, instead of http.agent.name, alternating agent names are
- chosen from a list provided via http.agent.rotate.file.
- </description>
- </property>
- <property>
- <name>http.agent.rotate.file</name>
- <value>agents.txt</value>
- <description>
- File containing alternative user agent names to be used instead of
- http.agent.name on a rotating basis if http.agent.rotate is true.
- Each line of the file should contain exactly one agent
- specification including name, version, description, URL, etc.
- </description>
- </property>
- <property>
- <name>http.agent.host</name>
- <value></value>
- <description>Name or IP address of the host on which the Nutch crawler
- would be running. Currently this is used by 'protocol-httpclient'
- plugin.
- </description>
- </property>
- <property>
- <name>http.timeout</name>
- <value>10000</value>
- <description>The default network timeout, in milliseconds.</description>
- </property>
- <property>
- <name>http.max.delays</name>
- <value>100</value>
- <description>The number of times a thread will delay when trying to
- fetch a page. Each time it finds that a host is busy, it will wait
- fetcher.server.delay. After http.max.delays attepts, it will give
- up on the page for now.
- 爬虫的网络延时线程等待时间,以秒计时,默认的配时间是3秒,视网络状况而定。如果
- 在爬虫运行的时候发现服务器返回了主机忙消息,则等待时间由fetcher.server.delay决定,
- 所以在网络状况不太好的情况下fetcher.server.delay也设置稍大一点的值较好,
- 此外还有一个http.timeout也和网络状况有关系
- </description>
- </property>
- <property>
- <name>http.content.limit</name>
- <value>65536</value>
- <description>The length limit for downloaded content using the http://
- protocol, in bytes. If this value is nonnegative (>=0), content longer
- than it will be truncated; otherwise, no truncation at all. Do not
- confuse this setting with the file.content.limit setting.
- 描述爬虫抓取的文档内容长度的配置项。默认值是65536,
- 也就是说抓取到的一个文档截取65KB左右,超过部分将被忽略,
- 对于抓取特定内容的搜索引擎需要修改此项,比如XML文档。
- 设置为-1表示不限制。
- </description>
- </property>
- <!--下面这四个属性是用来设置代理地址和端口,如果代理需要密码的话还需要设置用户名和密码 -->
- <property>
- <name>http.proxy.host</name>
- <value></value>
- <description>The proxy hostname. If empty, no proxy is used.</description>
- </property>
- <property>
- <name>http.proxy.port</name>
- <value></value>
- <description>The proxy port.</description>
- </property>
- <property>
- <name>http.proxy.username</name>
- <value></value>
- <description>Username for proxy. This will be used by
- 'protocol-httpclient', if the proxy server requests basic, digest
- and/or NTLM authentication. To use this, 'protocol-httpclient' must
- be present in the value of 'plugin.includes' property.
- NOTE: For NTLM authentication, do not prefix the username with the
- domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
- </description>
- </property>
- <property>
- <name>http.proxy.password</name>
- <value></value>
- <description>Password for proxy. This will be used by
- 'protocol-httpclient', if the proxy server requests basic, digest
- and/or NTLM authentication. To use this, 'protocol-httpclient' must
- be present in the value of 'plugin.includes' property.
- </description>
- </property>
- <property>
- <name>http.proxy.realm</name>
- <value></value>
- <description>Authentication realm for proxy. Do not define a value
- if realm is not required or authentication should take place for any
- realm. NTLM does not use the notion of realms. Specify the domain name
- of NTLM authentication as the value for this property. To use this,
- 'protocol-httpclient' must be present in the value of
- 'plugin.includes' property.
- </description>
- </property>
- <property>
- <name>http.auth.file</name>
- <value>httpclient-auth.xml</value>
- <description>Authentication configuration file for
- 'protocol-httpclient' plugin.
- </description>
- </property>
- <property>
- <name>http.verbose</name>
- <value>false</value>
- <description>If true, HTTP will log more verbosely.</description>
- </property>
- <property>
- <name>http.redirect.max</name>
- <value>0</value>
- <description>The maximum number of redirects the fetcher will follow when
- trying to fetch a page. If set to negative or 0, fetcher won't immediately
- follow redirected URLs, instead it will record them for later fetching.
- </description>
- </property>
- <property>
- <name>http.useHttp11</name>
- <value>false</value>
- <description>NOTE: at the moment this works only for protocol-httpclient.
- If true, use HTTP 1.1, if false use HTTP 1.0 .
- </description>
- </property>
- <property>
- <name>http.accept.language</name>
- <value>en-us,en-gb,en;q=0.7,*;q=0.3</value>
- <description>Value of the "Accept-Language" request header field.
- This allows selecting non-English language as default one to retrieve.
- It is a useful setting for search engines build for certain national group.
- </description>
- </property>
- <property>
- <name>http.accept</name>
- <value>text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value>
- <description>Value of the "Accept" request header field.
- </description>
- </property>
- <property>
- <name>http.store.responsetime</name>
- <value>true</value>
- <description>Enables us to record the response time of the
- host which is the time period between start connection to end
- connection of a pages host. The response time in milliseconds
- is stored in CrawlDb in CrawlDatum's meta data under key "_rs_"
- </description>
- </property>
- <property>
- <name>http.enable.if.modified.since.header</name>
- <value>true</value>
- <description>Whether Nutch sends an HTTP If-Modified-Since header. It reduces
- bandwidth when enabled by not downloading pages that respond with an HTTP
- Not-Modified header. URL's that are not downloaded are not passed through
- parse or indexing filters. If you regularly modify filters, you should force
- Nutch to also download unmodified pages by disabling this feature.
- </description>
- </property>
- <!-- FTP properties -->
- <property>
- <name>ftp.username</name>
- <value>anonymous</value>
- <description>ftp login username.</description>
- </property>
- <property>
- <name>ftp.password</name>
- <value>anonymous@example.com</value>
- <description>ftp login password.</description>
- </property>
- <property>
- <name>ftp.content.limit</name>
- <value>65536</value>
- <description>The length limit for downloaded content, in bytes.
- If this value is nonnegative (>=0), content longer than it will be truncated;
- otherwise, no truncation at all.
- Caution: classical ftp RFCs never defines partial transfer and, in fact,
- some ftp servers out there do not handle client side forced close-down very
- well. Our implementation tries its best to handle such situations smoothly.
- 默认情况下,nutch只抓取网页的前65536个字节,之后的内容将被丢弃。
- 但对于某些大型网站,首页的内容远远不止65536个字节,
- 甚至前面65536个字节里面均是一些布局信息,并没有任何的超链接。
- 设置为-1表示不限制。
- </description>
- </property>
- <property>
- <name>ftp.timeout</name>
- <value>60000</value>
- <description>Default timeout for ftp client socket, in millisec.
- Please also see ftp.keep.connection below.</description>
- </property>
- <property>
- <name>ftp.server.timeout</name>
- <value>100000</value>
- <description>An estimation of ftp server idle time, in millisec.
- Typically it is 120000 millisec for many ftp servers out there.
- Better be conservative here. Together with ftp.timeout, it is used to
- decide if we need to delete (annihilate) current ftp.client instance and
- force to start another ftp.client instance anew. This is necessary because
- a fetcher thread may not be able to obtain next request from queue in time
- (due to idleness) before our ftp client times out or remote server
- disconnects. Used only when ftp.keep.connection is true (please see below).
- </description>
- </property>
- <property>
- <name>ftp.keep.connection</name>
- <value>false</value>
- <description>Whether to keep ftp connection. Useful if crawling same host
- again and again. When set to true, it avoids connection, login and dir list
- parser setup for subsequent urls. If it is set to true, however, you must
- make sure (roughly):
- (1) ftp.timeout is less than ftp.server.timeout
- (2) ftp.timeout is larger than (fetcher.threads.fetch * fetcher.server.delay)
- Otherwise there will be too many "delete client because idled too long"
- messages in thread logs.</description>
- </property>
- <property>
- <name>ftp.follow.talk</name>
- <value>false</value>
- <description>Whether to log dialogue between our client and remote
- server. Useful for debugging.</description>
- </property>
- <!-- web db properties -->
- <property>
- <name>db.fetch.interval.default</name>
- <value>2592000</value>
- <description>The default number of seconds between re-fetches of a page (30 days).
- 这个功能对定期自动爬取需求的开发有用,设置多少天重新爬一个页面,默认2592000秒,即30天
- </description>
- </property>
- <property>
- <name>db.fetch.interval.max</name>
- <value>7776000</value>
- <description>The maximum number of seconds between re-fetches of a page
- (90 days). After this period every page in the db will be re-tried, no
- matter what is its status.
- </description>
- </property>
- <property>
- <name>db.fetch.schedule.class</name>
- <value>org.apache.nutch.crawl.DefaultFetchSchedule</value>
- <description>The implementation of fetch schedule. DefaultFetchSchedule simply
- adds the original fetchInterval to the last fetch time, regardless of
- page changes.</description>
- </property>
- <property>
- <name>db.fetch.schedule.adaptive.inc_rate</name>
- <value>0.4</value>
- <description>If a page is unmodified, its fetchInterval will be
- increased by this rate. This value should not
- exceed 0.5, otherwise the algorithm becomes unstable.</description>
- </property>
- <property>
- <name>db.fetch.schedule.adaptive.dec_rate</name>
- <value>0.2</value>
- <description>If a page is modified, its fetchInterval will be
- decreased by this rate. This value should not
- exceed 0.5, otherwise the algorithm becomes unstable.</description>
- </property>
- <property>
- <name>db.fetch.schedule.adaptive.min_interval</name>
- <value>60.0</value>
- <description>Minimum fetchInterval, in seconds.</description>
- </property>
- <property>
- <name>db.fetch.schedule.adaptive.max_interval</name>
- <value>31536000.0</value>
- <description>Maximum fetchInterval, in seconds (365 days).
- NOTE: this is limited by db.fetch.interval.max. Pages with
- fetchInterval larger than db.fetch.interval.max
- will be fetched anyway.</description>
- </property>
- <property>
- <name>db.fetch.schedule.adaptive.sync_delta</name>
- <value>true</value>
- <description>If true, try to synchronize with the time of page change.
- by shifting the next fetchTime by a fraction (sync_rate) of the difference
- between the last modification time, and the last fetch time.</description>
- </property>
- <property>
- <name>db.fetch.schedule.adaptive.sync_delta_rate</name>
- <value>0.3</value>
- <description>See sync_delta for description. This value should not
- exceed 0.5, otherwise the algorithm becomes unstable.</description>
- </property>
- <property>
- <name>db.fetch.schedule.mime.file</name>
- <value>adaptive-mimetypes.txt</value>
- <description>The configuration file for the MimeAdaptiveFetchSchedule.
- </description>
- </property>
- <property>
- <name>db.update.additions.allowed</name>
- <value>true</value>
- <description>If true, updatedb will add newly discovered URLs, if false
- only already existing URLs in the CrawlDb will be updated and no new
- URLs will be added.
- </description>
- </property>
- <property>
- <name>db.preserve.backup</name>
- <value>true</value>
- <description>If true, updatedb will keep a backup of the previous CrawlDB
- version in the old directory. In case of disaster, one can rename old to
- current and restore the CrawlDB to its previous state.
- </description>
- </property>
- <property>
- <name>db.update.purge.404</name>
- <value>false</value>
- <description>If true, updatedb will add purge records with status DB_GONE
- from the CrawlDB.
- </description>
- </property>
- <property>
- <name>db.url.normalizers</name>
- <value>false</value>
- <description>Normalize urls when updating crawldb</description>
- </property>
- <property>
- <name>db.url.filters</name>
- <value>false</value>
- <description>Filter urls when updating crawldb</description>
- </property>
- <property>
- <name>db.update.max.inlinks</name>
- <value>10000</value>
- <description>Maximum number of inlinks to take into account when updating
- a URL score in the crawlDB. Only the best scoring inlinks are kept.
- </description>
- </property>
- <property>
- <name>db.ignore.internal.links</name>
- <value>true</value>
- <description>If true, when adding new links to a page, links from
- the same host are ignored. This is an effective way to limit the
- size of the link database, keeping only the highest quality
- links.
- </description>
- </property>
- <property>
- <name>db.ignore.external.links</name>
- <value>false</value>
- <description>If true, outlinks leading from a page to external hosts
- will be ignored. This is an effective way to limit the crawl to include
- only initially injected hosts, without creating complex URLFilters.
- 若为true,则只抓取本域名内的网页,忽略外部链接。
- 可以在 regex-urlfilter.txt中增加过滤器达到同样效果,
- 但如果过滤器过多,如几千个,则会大大影响nutch的性能。
- </description>
- </property>
- <property>
- <name>db.injector.overwrite</name>
- <value>false</value>
- <description>Whether existing records in the CrawlDB will be overwritten
- by injected records.
- </description>
- </property>
- <property>
- <name>db.injector.update</name>
- <value>false</value>
- <description>If true existing records in the CrawlDB will be updated with
- injected records. Old meta data is preserved. The db.injector.overwrite
- parameter has precedence.
- </description>
- </property>
- <property>
- <name>db.score.injected</name>
- <value>1.0</value>
- <description>The score of new pages added by the injector.
- 注入时url的默认网页得分(重要程度)
- </description>
- </property>
- <property>
- <name>db.score.link.external</name>
- <value>1.0</value>
- <description>The score factor for new pages added due to a link from
- another host relative to the referencing page's score. Scoring plugins
- may use this value to affect initial scores of external links.
- </description>
- </property>
- <property>
- <name>db.score.link.internal</name>
- <value>1.0</value>
- <description>The score factor for pages added due to a link from the
- same host, relative to the referencing page's score. Scoring plugins
- may use this value to affect initial scores of internal links.
- </description>
- </property>
- <property>
- <name>db.score.count.filtered</name>
- <value>false</value>
- <description>The score value passed to newly discovered pages is
- calculated as a fraction of the original page score divided by the
- number of outlinks. If this option is false, only the outlinks that passed
- URLFilters will count, if it's true then all outlinks will count.
- </description>
- </property>
- <property>
- <name>db.max.inlinks</name>
- <value>10000</value>
- <description>Maximum number of Inlinks per URL to be kept in LinkDb.
- If "invertlinks" finds more inlinks than this number, only the first
- N inlinks will be stored, and the rest will be discarded.
- </description>
- </property>
- <property>
- <name>db.max.outlinks.per.page</name>
- <value>100</value>
- <description>The maximum number of outlinks that we'll process for a page.
- If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
- will be processed for a page; otherwise, all outlinks will be processed.
- 默认情况下,Nutch只抓取某个网页的100个外部链接,导致部分链接无法抓取。
- 若要改变此情况,可以修改此配置项,可以增大 或者设置为-1,-1表示不限制。
- </description>
- </property>
- <property>
- <name>db.max.anchor.length</name>
- <value>100</value>
- <description>The maximum number of characters permitted in an anchor.
- </description>
- </property>
- <property>
- <name>db.parsemeta.to.crawldb</name>
- <value></value>
- <description>Comma-separated list of parse metadata keys to transfer to the crawldb (NUTCH-779).
- Assuming for instance that the languageidentifier plugin is enabled, setting the value to 'lang'
- will copy both the key 'lang' and its value to the corresponding entry in the crawldb.
- </description>
- </property>
- <property>
- <name>db.fetch.retry.max</name>
- <value>3</value>
- <description>The maximum number of times a url that has encountered
- recoverable errors is generated for fetch.</description>
- </property>
- <property>
- <name>db.signature.class</name>
- <value>org.apache.nutch.crawl.MD5Signature</value>
- <description>The default implementation of a page signature. Signatures
- created with this implementation will be used for duplicate detection
- and removal.</description>
- </property>
- <property>
- <name>db.signature.text_profile.min_token_len</name>
- <value>2</value>
- <description>Minimum token length to be included in the signature.
- </description>
- </property>
- <property>
- <name>db.signature.text_profile.quant_rate</name>
- <value>0.01</value>
- <description>Profile frequencies will be rounded down to a multiple of
- QUANT = (int)(QUANT_RATE * maxFreq), where maxFreq is a maximum token
- frequency. If maxFreq > 1 then QUANT will be at least 2, which means that
- for longer texts tokens with frequency 1 will always be discarded.
- </description>
- </property>
- <!-- generate properties -->
- <property>
- <name>generate.max.count</name>
- <value>-1</value>
- <description>The maximum number of urls in a single
- fetchlist. -1 if unlimited. The urls are counted according
- to the value of the parameter generator.count.mode.
- 与generate.count.mode相配合,限制生成的每个fetchlist里属于
- 同一个host/domain/ip的URL最多为(generate.max.count-1)个
- -1,表示不限制一个fetchlist里面有多少个属于同一个host/domain/ip的url
- </description>
- </property>
- <property>
- <name>generate.count.mode</name>
- <value>host</value>
- <description>Determines how the URLs are counted for generator.max.count.
- Default value is 'host' but can be 'domain'. Note that we do not count
- per IP in the new version of the Generator.
- byHost/byDomain/byIP三种,表示按照何种方式计数,
- 以达到generate.max.count指定的数量
- byHost,即根据host来对每个fetchlist中的url进行计数,
- 同一个segment里面属于同一个host的url不能超过generate.max.count,
- 如果超过,则需要将其他属于该host的url放到新的fetchlist中(如果还有新的fetchlist未放满的话)
- </description>
- </property>
- <property>
- <name>generate.update.crawldb</name>
- <value>false</value>
- <description>For highly-concurrent environments, where several
- generate/fetch/update cycles may overlap, setting this to true ensures
- that generate will create different fetchlists even without intervening
- updatedb-s, at the cost of running an additional job to update CrawlDB.
- If false, running generate twice without intervening
- updatedb will generate identical fetchlists.
- 是否在generator完成之后,更新crawldb,主要是更新CrawlDatum的_ngt_字段
- 为此次执行generator的时间,防止下次generator
- (由参数crawl.gen.delay指定的时间之内开始的另一个generator),
- 加入相同的url(备注:即使下次generator加入相同的url,也不会造成逻辑错误,
- 只是会浪费资源,重复爬取相同URL)
- </description>
- </property>
- <property>
- <name>generate.min.score</name>
- <value>0</value>
- <description>Select only entries with a score larger than
- generate.min.score.
- 如果经过ScoreFilters之后,url的score值(反应网页重要性的值,类似于PageRank值)
- 仍然小于generate.min.score值,则该url不加入fetchlist中(即跳过该URL),
- 配置了改值,表明generator只会考虑将较重要的网页加入到fetchlist
- 0,表明所有url都不会因为score值在generator阶段被过滤掉
- </description>
- </property>
- <property>
- <name>generate.min.interval</name>
- <value>-1</value>
- <description>Select only entries with a retry interval lower than
- generate.min.interval. A value of -1 disables this check.
- 设置该值表示generator只考虑需要频繁采集的url(即:CrawlDatum的fetchInterval较小),
- 对于不需要频繁采集的url,不加入到fetchlist
- -1,表明禁用该检查
- </description>
- </property>
- <!-- urlpartitioner properties -->
- <property>
- <name>partition.url.mode</name>
- <value>byHost</value>
- <description>Determines how to partition URLs. Default value is 'byHost',
- also takes 'byDomain' or 'byIP'.
- 这个配置用来设定mapper操作以后,partition操作根据Host进行Hash。
- 结果是具有相同Host的URL会被打到同一个Reduce节点上面
- 在对生成的fetchlist做划分(partition)的时候,划分的方式是什么,有如下3中:byHost/byDomain/byIP
- </description>
- </property>
- <property>
- <name>crawl.gen.delay</name>
- <value>604800000</value>
- <description>
- This value, expressed in milliseconds, defines how long we should keep the lock on records
- in CrawlDb that were just selected for fetching. If these records are not updated
- in the meantime, the lock is canceled, i.e. they become eligible for selecting.
- Default value of this is 7 days (604800000 ms).
- generator执行时,会使用“_ngt_”(stand for ”nutch generate time“)
- 作为key来来存储上一次对该url调用generator的时间,表明该url已经加入到了某个fetchlist,
- 并可能正在完成fetch->updated的过程当中,而可能这个过程时间较长,也或者过程中出错了,
- 而generator执行的过程当中。在考虑该url是否能加入此次的fetchlist时,
- 需要一种机制来判断是否能将该url加入还是继续等待之前的fetch->updatedb流程完成
- (这样crawldb中该url的_ngt_会被更新成上次成功执行generator的时间。
- crawl.gen.deley就是用来解决该问题的,如果”_ngt_”+crawl.gen.delay 小于 当前时间,
- 则该url可以加入到本次生成的fetchlist中;否则,不加入(跳过该url)
- </description>
- </property>
- <!-- fetcher properties -->
- <property>
- <name>fetcher.server.delay</name>
- <value>5.0</value>
- <description>The number of seconds the fetcher will delay between
- successive requests to the same server. Note that this might get
- overriden by a Crawl-Delay from a robots.txt and is used ONLY if
- fetcher.threads.per.queue is set to 1.
- </description>
- </property>
- <property>
- <name>fetcher.server.min.delay</name>
- <value>0.0</value>
- <description>The minimum number of seconds the fetcher will delay between
- successive requests to the same server. This value is applicable ONLY
- if fetcher.threads.per.queue is greater than 1 (i.e. the host blocking
- is turned off).</description>
- </property>
- <property>
- <name>fetcher.max.crawl.delay</name>
- <value>30</value>
- <description>
- If the Crawl-Delay in robots.txt is set to greater than this value (in
- seconds) then the fetcher will skip this page, generating an error report.
- If set to -1 the fetcher will never skip such pages and will wait the
- amount of time retrieved from robots.txt Crawl-Delay, however long that
- might be.
- </description>
- </property>
- <property>
- <name>fetcher.threads.fetch</name>
- <value>10</value>
- <description>The number of FetcherThreads the fetcher should use.
- This is also determines the maximum number of requests that are
- made at once (each FetcherThread handles one connection). The total
- number of threads running in distributed mode will be the number of
- fetcher threads * number of nodes as fetcher has one map task per node.
- 最大抓取线程数量
- </description>
- </property>
- <property>
- <name>fetcher.threads.per.queue</name>
- <value>1</value>
- <description>This number is the maximum number of threads that
- should be allowed to access a queue at one time. Setting it to
- a value > 1 will cause the Crawl-Delay value from robots.txt to
- be ignored and the value of fetcher.server.min.delay to be used
- as a delay between successive requests to the same server instead
- of fetcher.server.delay.
- </description>
- </property>
- <property>
- <name>fetcher.queue.mode</name>
- <value>byHost</value>
- <description>Determines how to put URLs into queues. Default value is 'byHost',
- also takes 'byDomain' or 'byIP'.
- </description>
- </property>
- <property>
- <name>fetcher.verbose</name>
- <value>false</value>
- <description>If true, fetcher will log more verbosely.
- 如果是true,打印出更多详细信息
- </description>
- </property>
- <property>
- <name>fetcher.parse</name>
- <value>false</value>
- <description>If true, fetcher will parse content. Default is false, which means
- that a separate parsing step is required after fetching is finished.
- 能否在抓取的同时进行解释:可以,但不 建议这样做。
- </description>
- </property>
- <property>
- <name>fetcher.store.content</name>
- <value>true</value>
- <description>If true, fetcher will store content.</description>
- </property>
- <property>
- <name>fetcher.timelimit.mins</name>
- <value>-1</value>
- <description>This is the number of minutes allocated to the fetching.
- Once this value is reached, any remaining entry from the input URL list is skipped
- and all active queues are emptied. The default value of -1 deactivates the time limit.
- </description>
- </property>
- <property>
- <name>fetcher.max.exceptions.per.queue</name>
- <value>-1</value>
- <description>The maximum number of protocol-level exceptions (e.g. timeouts) per
- host (or IP) queue. Once this value is reached, any remaining entries from this
- queue are purged, effectively stopping the fetching from this host/IP. The default
- value of -1 deactivates this limit.
- </description>
- </property>
- <property>
- <name>fetcher.throughput.threshold.pages</name>
- <value>-1</value>
- <description>The threshold of minimum pages per second. If the fetcher downloads less
- pages per second than the configured threshold, the fetcher stops, preventing slow queue's
- from stalling the throughput. This threshold must be an integer. This can be useful when
- fetcher.timelimit.mins is hard to determine. The default value of -1 disables this check.
- </description>
- </property>
- <property>
- <name>fetcher.throughput.threshold.retries</name>
- <value>5</value>
- <description>The number of times the fetcher.throughput.threshold is allowed to be exceeded.
- This settings prevents accidental slow downs from immediately killing the fetcher thread.
- </description>
- </property>
- <property>
- <name>fetcher.throughput.threshold.check.after</name>
- <value>5</value>
- <description>The number of minutes after which the throughput check is enabled.</description>
- </property>
- <property>
- <name>fetcher.threads.timeout.divisor</name>
- <value>2</value>
- <description>(EXPERT)The thread time-out divisor to use. By default threads have a time-out
- value of mapred.task.timeout / 2. Increase this setting if the fetcher waits too
- long before killing hanged threads. Be careful, a too high setting (+8) will most likely kill the
- fetcher threads prematurely.
- </description>
- </property>
- <property>
- <name>fetcher.queue.depth.multiplier</name>
- <value>50</value>
- <description>(EXPERT)The fetcher buffers the incoming URLs into queues based on the [host|domain|IP]
- (see param fetcher.queue.mode). The depth of the queue is the number of threads times the value of this parameter.
- A large value requires more memory but can improve the performance of the fetch when the order of the URLS in the fetch list
- is not optimal.
- </description>
- </property>
- <property>
- <name>fetcher.follow.outlinks.depth</name>
- <value>-1</value>
- <description>(EXPERT)When fetcher.parse is true and this value is greater than 0 the fetcher will extract outlinks
- and follow until the desired depth is reached. A value of 1 means all generated pages are fetched and their first degree
- outlinks are fetched and parsed too. Be careful, this feature is in itself agnostic of the state of the CrawlDB and does not
- know about already fetched pages. A setting larger than 2 will most likely fetch home pages twice in the same fetch cycle.
- It is highly recommended to set db.ignore.external.links to true to restrict the outlink follower to URL's within the same
- domain. When disabled (false) the feature is likely to follow duplicates even when depth=1.
- A value of -1 of 0 disables this feature.
- </description>
- </property>
- <property>
- <name>fetcher.follow.outlinks.num.links</name>
- <value>4</value>
- <description>(EXPERT)The number of outlinks to follow when fetcher.follow.outlinks.depth is enabled. Be careful, this can multiply
- the total number of pages to fetch. This works with fetcher.follow.outlinks.depth.divisor, by default settings the followed outlinks
- at depth 1 is 8, not 4.
- </description>
- </property>
- <property>
- <name>fetcher.follow.outlinks.depth.divisor</name>
- <value>2</value>
- <description>(EXPERT)The divisor of fetcher.follow.outlinks.num.links per fetcher.follow.outlinks.depth. This decreases the number
- of outlinks to follow by increasing depth. The formula used is: outlinks = floor(divisor / depth * num.links). This prevents
- exponential growth of the fetch list.
- </description>
- </property>
- <property>
- <name>fetcher.follow.outlinks.ignore.external</name>
- <value>true</value>
- <description>Whether to ignore or follow external links. Set db.ignore.external.links to false and this to true to store outlinks
- in the output but not follow them. If db.ignore.external.links is true this directive is ignored.
- </description>
- </property>
- <property>
- <name>fetcher.bandwidth.target</name>
- <value>-1</value>
- <description>Target bandwidth in kilobits per sec for each mapper instance. This is used to adjust the number of
- fetching threads automatically (up to fetcher.maxNum.threads). A value of -1 deactivates the functionality, in which case
- the number of fetching threads is fixed (see fetcher.threads.fetch).</description>
- </property>
- <property>
- <name>fetcher.maxNum.threads</name>
- <value>25</value>
- <description>Max number of fetch threads allowed when using fetcher.bandwidth.target. Defaults to fetcher.threads.fetch if unspecified or
- set to a value lower than it. </description>
- </property>
- <property>
- <name>fetcher.bandwidth.target.check.everyNSecs</name>
- <value>30</value>
- <description>(EXPERT) Value in seconds which determines how frequently we should reassess the optimal number of fetch threads when using
- fetcher.bandwidth.target. Defaults to 30 and must be at least 1.</description>
- </property>
- <!-- moreindexingfilter plugin properties -->
- <property>
- <name>moreIndexingFilter.indexMimeTypeParts</name>
- <value>true</value>
- <description>Determines whether the index-more plugin will split the mime-type
- in sub parts, this requires the type field to be multi valued. Set to true for backward
- compatibility. False will not split the mime-type.
- </description>
- </property>
- <property>
- <name>moreIndexingFilter.mapMimeTypes</name>
- <value>false</value>
- <description>Determines whether MIME-type mapping is enabled. It takes a
- plain text file with mapped MIME-types. With it the user can map both
- application/xhtml+xml and text/html to the same target MIME-type so it
- can be treated equally in an index. See conf/contenttype-mapping.txt.
- </description>
- </property>
- <!-- AnchorIndexing filter plugin properties -->
- <property>
- <name>anchorIndexingFilter.deduplicate</name>
- <value>false</value>
- <description>With this enabled the indexer will case-insensitive deduplicate anchors
- before indexing. This prevents possible hundreds or thousands of identical anchors for
- a given page to be indexed but will affect the search scoring (i.e. tf=1.0f).
- </description>
- </property>
- <!-- indexingfilter plugin properties -->
- <property>
- <name>indexingfilter.order</name>
- <value></value>
- <description>The order by which index filters are applied.
- If empty, all available index filters (as dictated by properties
- plugin-includes and plugin-excludes above) are loaded and applied in system
- defined order. If not empty, only named filters are loaded and applied
- in given order. For example, if this property has value:
- org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter
- then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
- Filter ordering might have impact on result if one filter depends on output of
- another filter.
- </description>
- </property>
- <property>
- <name>indexer.score.power</name>
- <value>0.5</value>
- <description>Determines the power of link analyis scores. Each
- pages's boost is set to <i>score<sup>scorePower</sup></i> where
- <i>score</i> is its link analysis score and <i>scorePower</i> is the
- value of this parameter. This is compiled into indexes, so, when
- this is changed, pages must be re-indexed for it to take
- effect.</description>
- </property>
- <property>
- <name>indexer.max.title.length</name>
- <value>100</value>
- <description>The maximum number of characters of a title that are indexed. A value of -1 disables this check.
- </description>
- </property>
- <property>
- <name>indexer.max.content.length</name>
- <value>-1</value>
- <description>The maximum number of characters of a content that are indexed.
- Content beyond the limit is truncated. A value of -1 disables this check.
- </description>
- </property>
- <property>
- <name>indexer.add.domain</name>
- <value>false</value>
- <description>Whether to add the domain field to a NutchDocument.</description>
- </property>
- <property>
- <name>indexer.skip.notmodified</name>
- <value>false</value>
- <description>Whether the indexer will skip records with a db_notmodified status.
- </description>
- </property>
- <!-- URL normalizer properties -->
- <property>
- <name>urlnormalizer.order</name>
- <value>org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
- <description>Order in which normalizers will run. If any of these isn't
- activated it will be silently skipped. If other normalizers not on the
- list are activated, they will run in random order after the ones
- specified here are run.
- </description>
- </property>
- <property>
- <name>urlnormalizer.regex.file</name>
- <value>regex-normalize.xml</value>
- <description>Name of the config file used by the RegexUrlNormalizer class.
- </description>
- </property>
- <property>
- <name>urlnormalizer.loop.count</name>
- <value>1</value>
- <description>Optionally loop through normalizers several times, to make
- sure that all transformations have been performed.
- </description>
- </property>
- <!-- mime properties -->
- <!--
- <property>
- <name>mime.types.file</name>
- <value>tika-mimetypes.xml</value>
- <description>Name of file in CLASSPATH containing filename extension and
- magic sequence to mime types mapping information. Overrides the default Tika config
- if specified.
- </description>
- </property>
- -->
- <property>
- <name>mime.type.magic</name>
- <value>true</value>
- <description>Defines if the mime content type detector uses magic resolution.
- </description>
- </property>
- <!-- plugin properties -->
- <property>
- <name>plugin.folders</name>
- <value>plugins</value>
- <description>Directories where nutch plugins are located. Each
- element may be a relative or absolute path. If absolute, it is used
- as is. If relative, it is searched for on the classpath.
- 这个属性是用来指定plugin的目录,在eclipse中执行时需要改为:./src/plugin
- 但是在分布式集群运行打成的JOB包时,需要改为plugins
- </description>
- </property>
- <property>
- <name>plugin.auto-activation</name>
- <value>true</value>
- <description>Defines if some plugins that are not activated regarding
- the plugin.includes and plugin.excludes properties must be automaticaly
- activated if they are needed by some actived plugins.
- </description>
- </property>
- <property>
- <name>plugin.includes</name>
- <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
- <description>Regular expression naming plugin directory names to
- include. Any plugin not matching this expression is excluded.
- In any case you need at least include the nutch-extensionpoints plugin. By
- default Nutch includes crawling just HTML and plain text via HTTP,
- and basic indexing and search plugins. In order to use HTTPS please enable
- protocol-httpclient, but be aware of possible intermittent problems with the
- underlying commons-httpclient library.
- 配置插件功能的配置项,plugin.includes表示需要加载的插件列表
- </description>
- </property>
- <property>
- <name>plugin.excludes</name>
- <value></value>
- <description>Regular expression naming plugin directory names to exclude.
- </description>
- </property>
- <property>
- <name>urlmeta.tags</name>
- <value></value>
- <description>
- To be used in conjunction with features introduced in NUTCH-655, which allows
- for custom metatags to be injected alongside your crawl URLs. Specifying those
- custom tags here will allow for their propagation into a pages outlinks, as
- well as allow for them to be included as part of an index.
- Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags with
- white-space at their boundaries, if you are using anything earlier than Hadoop-0.21.
- </description>
- </property>
- <!-- parser properties -->
- <property>
- <name>parse.plugin.file</name>
- <value>parse-plugins.xml</value>
- <description>The name of the file that defines the associations between
- content-types and parsers.</description>
- </property>
- <property>
- <name>parser.character.encoding.default</name>
- <value>windows-1252</value>
- <description>The character encoding to fall back to when no other information
- is available
- 解析文档的时候使用的默认编码windows-1252,如果文档中解析不到编码,则使用默认编码
- </description>
- </property>
- <property>
- <name>encodingdetector.charset.min.confidence</name>
- <value>-1</value>
- <description>A integer between 0-100 indicating minimum confidence value
- for charset auto-detection. Any negative value disables auto-detection.
- </description>
- </property>
- <property>
- <name>parser.caching.forbidden.policy</name>
- <value>content</value>
- <description>If a site (or a page) requests through its robot metatags
- that it should not be shown as cached content, apply this policy. Currently
- three keywords are recognized: "none" ignores any "noarchive" directives.
- "content" doesn't show the content, but shows summaries (snippets).
- "all" doesn't show either content or summaries.</description>
- </property>
- <property>
- <name>parser.html.impl</name>
- <value>neko</value>
- <description>HTML Parser implementation. Currently the following keywords
- are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.
- 制定解析HTML文档的时候使用的解析器,NEKO功能比较强大,
- 后面会有专门的文章介绍Neko从HTML到TEXT以及html片断的解析等功能做介绍
- </description>
- </property>
- <property>
- <name>parser.html.form.use_action</name>
- <value>false</value>
- <description>If true, HTML parser will collect URLs from form action
- attributes. This may lead to undesirable behavior (submitting empty
- forms during next fetch cycle). If false, form action attribute will
- be ignored.</description>
- </property>
- <property>
- <name>parser.html.outlinks.ignore_tags</name>
- <value></value>
- <description>Comma separated list of HTML tags, from which outlinks
- shouldn't be extracted. Nutch takes links from: a, area, form, frame,
- iframe, script, link, img. If you add any of those tags here, it
- won't be taken. Default is empty list. Probably reasonable value
- for most people would be "img,script,link".</description>
- </property>
- <property>
- <name>htmlparsefilter.order</name>
- <value></value>
- <description>The order by which HTMLParse filters are applied.
- If empty, all available HTMLParse filters (as dictated by properties
- plugin-includes and plugin-excludes above) are loaded and applied in system
- defined order. If not empty, only named filters are loaded and applied
- in given order.
- HTMLParse filter ordering MAY have an impact
- on end result, as some filters could rely on the metadata generated by a previous filter.
- </description>
- </property>
- <property>
- <name>parser.timeout</name>
- <value>30</value>
- <description>Timeout in seconds for the parsing of a document, otherwise treats it as an exception and
- moves on the the following documents. This parameter is applied to any Parser implementation.
- Set to -1 to deactivate, bearing in mind that this could cause
- the parsing to crash because of a very long or corrupted document.
- </description>
- </property>
- <property>
- <name>parse.filter.urls</name>
- <value>true</value>
- <description>Whether the parser will filter URLs (with the configured URL filters).</description>
- </property>
- <property>
- <name>parse.normalize.urls</name>
- <value>true</value>
- <description>Whether the parser will normalize URLs (with the configured URL normalizers).</description>
- </property>
- <property>
- <name>parser.skip.truncated</name>
- <value>true</value>
- <description>Boolean value for whether we should skip parsing for truncated documents. By default this
- property is activated due to extremely high levels of CPU which parsing can sometimes take.
- </description>
- </property>
- <!--
- <property>
- <name>tika.htmlmapper.classname</name>
- <value>org.apache.tika.parser.html.IdentityHtmlMapper</value>
- <description>Classname of Tika HTMLMapper to use. Influences the elements included in the DOM and hence
- the behaviour of the HTMLParseFilters.
- </description>
- </property>
- -->
- <property>
- <name>tika.uppercase.element.names</name>
- <value>true</value>
- <description>Determines whether TikaParser should uppercase the element name while generating the DOM
- for a page, as done by Neko (used per default by parse-html)(see NUTCH-1592).
- </description>
- </property>
- <!-- urlfilter plugin properties -->
- <property>
- <name>urlfilter.domain.file</name>
- <value>domain-urlfilter.txt</value>
- <description>Name of file on CLASSPATH containing either top level domains or
- hostnames used by urlfilter-domain (DomainURLFilter) plugin.</description>
- </property>
- <property>
- <name>urlfilter.regex.file</name>
- <value>regex-urlfilter.txt</value>
- <description>Name of file on CLASSPATH containing regular expressions
- used by urlfilter-regex (RegexURLFilter) plugin.</description>
- </property>
- <property>
- <name>urlfilter.automaton.file</name>
- <value>automaton-urlfilter.txt</value>
- <description>Name of file on CLASSPATH containing regular expressions
- used by urlfilter-automaton (AutomatonURLFilter) plugin.</description>
- </property>
- <property>
- <name>urlfilter.prefix.file</name>
- <value>prefix-urlfilter.txt</value>
- <description>Name of file on CLASSPATH containing url prefixes
- used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
- </property>
- <property>
- <name>urlfilter.suffix.file</name>
- <value>suffix-urlfilter.txt</value>
- <description>Name of file on CLASSPATH containing url suffixes
- used by urlfilter-suffix (SuffixURLFilter) plugin.</description>
- </property>
- <property>
- <name>urlfilter.order</name>
- <value></value>
- <description>The order by which url filters are applied.
- If empty, all available url filters (as dictated by properties
- plugin-includes and plugin-excludes above) are loaded and applied in system
- defined order. If not empty, only named filters are loaded and applied
- in given order. For example, if this property has value:
- org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter
- then RegexURLFilter is applied first, and PrefixURLFilter second.
- Since all filters are AND'ed, filter ordering does not have impact
- on end result, but it may have performance implication, depending
- on relative expensiveness of filters.
- </description>
- </property>
- <!-- scoring filters properties -->
- <property>
- <name>scoring.filter.order</name>
- <value></value>
- <description>The order in which scoring filters are applied. This
- may be left empty (in which case all available scoring filters will
- be applied in system defined order), or a space separated list of
- implementation classes.
- </description>
- </property>
- <!-- scoring-depth properties
- Add 'scoring-depth' to the list of active plugins
- in the parameter 'plugin.includes' in order to use it.
- -->
- <property>
- <name>scoring.depth.max</name>
- <value>1000</value>
- <description>Max depth value from seed allowed by default.
- Can be overriden on a per-seed basis by specifying "_maxdepth_=VALUE"
- as a seed metadata. This plugin adds a "_depth_" metadatum to the pages
- to track the distance from the seed it was found from.
- The depth is used to prioritise URLs in the generation step so that
- shallower pages are fetched first.
- </description>
- </property>
- <!-- language-identifier plugin properties -->
- <property>
- <name>lang.analyze.max.length</name>
- <value>2048</value>
- <description> The maximum bytes of data to uses to indentify
- the language (0 means full content analysis).
- The larger is this value, the better is the analysis, but the
- slowest it is.
- 和语言有关系,分词的时候会用到,
- </description>
- </property>
- <property>
- <name>lang.extraction.policy</name>
- <value>detect,identify</value>
- <description>This determines when the plugin uses detection and
- statistical identification mechanisms. The order in which the
- detect and identify are written will determine the extraction
- policy. Default case (detect,identify) means the plugin will
- first try to extract language info from page headers and metadata,
- if this is not successful it will try using tika language
- identification. Possible values are:
- detect
- identify
- detect,identify
- identify,detect
- </description>
- </property>
- <property>
- <name>lang.identification.only.certain</name>
- <value>false</value>
- <description>If set to true with lang.extraction.policy containing identify,
- the language code returned by Tika will be assigned to the document ONLY
- if it is deemed certain by Tika.
- </description>
- </property>
- <!-- index-static plugin properties -->
- <property>
- <name>index.static</name>
- <value></value>
- <description>
- Used by plugin index-static to adds fields with static data at indexing time.
- You can specify a comma-separated list of fieldname:fieldcontent per Nutch job.
- Each fieldcontent can have multiple values separated by space, e.g.,
- field1:value1.1 value1.2 value1.3,field2:value2.1 value2.2 ...
- It can be useful when collections can't be created by URL patterns,
- like in subcollection, but on a job-basis.
- </description>
- </property>
- <!-- index-metadata plugin properties -->
- <property>
- <name>index.parse.md</name>
- <value>metatag.description,metatag.keywords</value>
- <description>
- Comma-separated list of keys to be taken from the parse metadata to generate fields.
- Can be used e.g. for 'description' or 'keywords' provided that these values are generated
- by a parser (see parse-metatags plugin)
- </description>
- </property>
- <property>
- <name>index.content.md</name>
- <value></value>
- <description>
- Comma-separated list of keys to be taken from the content metadata to generate fields.
- </description>
- </property>
- <property>
- <name>index.db.md</name>
- <value></value>
- <description>
- Comma-separated list of keys to be taken from the crawldb metadata to generate fields.
- Can be used to index values propagated from the seeds with the plugin urlmeta
- </description>
- </property>
- <!-- index-geoip plugin properties -->
- <property>
- <name>index.geoip.usage</name>
- <value>insightsService</value>
- <description>
- A string representing the information source to be used for GeoIP information
- association. Either enter 'cityDatabase', 'connectionTypeDatabase',
- 'domainDatabase', 'ispDatabase' or 'insightsService'. If you wish to use any one of the
- Database options, you should make one of GeoIP2-City.mmdb, GeoIP2-Connection-Type.mmdb,
- GeoIP2-Domain.mmdb or GeoIP2-ISP.mmdb files respectively available on the classpath and
- available at runtime.
- </description>
- </property>
- <property>
- <name>index.geoip.userid</name>
- <value></value>
- <description>
- The userId associated with the GeoIP2 Precision Services account.
- </description>
- </property>
- <property>
- <name>index.geoip.licensekey</name>
- <value></value>
- <description>
- The license key associated with the GeoIP2 Precision Services account.
- </description>
- </property>
- <!-- parse-metatags plugin properties -->
- <property>
- <name>metatags.names</name>
- <value>description,keywords</value>
- <description> Names of the metatags to extract, separated by ','.
- Use '*' to extract all metatags. Prefixes the names with 'metatag.'
- in the parse-metadata. For instance to index description and keywords,
- you need to activate the plugin index-metadata and set the value of the
- parameter 'index.parse.md' to 'metatag.description,metatag.keywords'.
- </description>
- </property>
- <!-- Temporary Hadoop 0.17.x workaround. -->
- <property>
- <name>hadoop.job.history.user.location</name>
- <value>${hadoop.log.dir}/history/user</value>
- <description>Hadoop 0.17.x comes with a default setting to create
- user logs inside the output path of the job. This breaks some
- Hadoop classes, which expect the output to contain only
- part-XXXXX files. This setting changes the output to a
- subdirectory of the regular log directory.
- </description>
- </property>
- <!-- linkrank scoring properties -->
- <property>
- <name>link.ignore.internal.host</name>
- <value>true</value>
- <description>Ignore outlinks to the same hostname.</description>
- </property>
- <property>
- <name>link.ignore.internal.domain</name>
- <value>true</value>
- <description>Ignore outlinks to the same domain.</description>
- </property>
- <property>
- <name>link.ignore.limit.page</name>
- <value>true</value>
- <description>Limit to only a single outlink to the same page.</description>
- </property>
- <property>
- <name>link.ignore.limit.domain</name>
- <value>true</value>
- <description>Limit to only a single outlink to the same domain.</description>
- </property>
- <property>
- <name>link.analyze.num.iterations</name>
- <value>10</value>
- <description>The number of LinkRank iterations to run.</description>
- </property>
- <property>
- <name>link.analyze.initial.score</name>
- <value>1.0f</value>
- <description>The initial score.</description>
- </property>
- <property>
- <name>link.analyze.damping.factor</name>
- <value>0.85f</value>
- <description>The damping factor.</description>
- </property>
- <property>
- <name>link.delete.gone</name>
- <value>false</value>
- <description>Whether to delete gone pages from the web graph.</description>
- </property>
- <property>
- <name>link.loops.depth</name>
- <value>2</value>
- <description>The depth for the loops algorithm.</description>
- </property>
- <property>
- <name>link.score.updater.clear.score</name>
- <value>0.0f</value>
- <description>The default score for URL's that are not in the web graph.</description>
- </property>
- <property>
- <name>mapreduce.fileoutputcommitter.marksuccessfuljobs</name>
- <value>false</value>
- <description>Hadoop >= 0.21 generates SUCCESS files in the output which can crash
- the readers. This should not be an issue once Nutch is ported to the new MapReduce API
- but for now this parameter should prevent such cases.
- </description>
- </property>
- <!-- solr index properties -->
- <property>
- <name>solr.server.url</name>
- <value>http://127.0.0.1:8983/solr/</value>
- <description>
- Defines the Solr URL into which data should be indexed using the
- indexer-solr plugin.
- </description>
- </property>
- <property>
- <name>solr.mapping.file</name>
- <value>solrindex-mapping.xml</value>
- <description>
- Defines the name of the file that will be used in the mapping of internal
- nutch field names to solr index fields as specified in the target Solr schema.
- </description>
- </property>
- <property>
- <name>solr.commit.size</name>
- <value>250</value>
- <description>
- Defines the number of documents to send to Solr in a single update batch.
- Decrease when handling very large documents to prevent Nutch from running
- out of memory. NOTE: It does not explicitly trigger a server side commit.
- </description>
- </property>
- <property>
- <name>solr.commit.index</name>
- <value>true</value>
- <description>
- When closing the indexer, trigger a commit to the Solr server.
- </description>
- </property>
- <property>
- <name>solr.auth</name>
- <value>false</value>
- <description>
- Whether to enable HTTP basic authentication for communicating with Solr.
- Use the solr.auth.username and solr.auth.password properties to configure
- your credentials.
- </description>
- </property>
- <!-- Elasticsearch properties -->
- <property>
- <name>elastic.host</name>
- <value></value>
- <description>The hostname to send documents to using TransportClient. Either host
- and port must be defined or cluster.</description>
- </property>
- <property>
- <name>elastic.port</name>
- <value>9300</value>The port to connect to using TransportClient.<description>
- </description>
- </property>
- <property>
- <name>elastic.cluster</name>
- <value></value>
- <description>The cluster name to discover. Either host and potr must be defined
- or cluster.</description>
- </property>
- <property>
- <name>elastic.index</name>
- <value>nutch</value>
- <description>Default index to send documents to.</description>
- </property>
- <property>
- <name>elastic.max.bulk.docs</name>
- <value>250</value>
- <description>Maximum size of the bulk in number of documents.</description>
- </property>
- <property>
- <name>elastic.max.bulk.size</name>
- <value>2500500</value>
- <description>Maximum size of the bulk in bytes.</description>
- </property>
- <!-- subcollection properties -->
- <property>
- <name>subcollection.default.fieldname</name>
- <value>subcollection</value>
- <description>
- The default field name for the subcollections.
- </description>
- </property>
- <!-- Headings plugin properties -->
- <property>
- <name>headings</name>
- <value>h1,h2</value>
- <description>Comma separated list of headings to retrieve from the document</description>
- </property>
- <property>
- <name>headings.multivalued</name>
- <value>false</value>
- <description>Whether to support multivalued headings.</description>
- </property>
- <!-- mimetype-filter plugin properties -->
- <property>
- <name>mimetype.filter.file</name>
- <value>mimetype-filter.txt</value>
- <description>
- The configuration file for the mimetype-filter plugin. This file contains
- the rules used to allow or deny the indexing of certain documents.
- </description>
- </property>
- </configuration>
regex-urlfilter解释.txt
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License. # The default url filter.
# Better for whole-internet crawling. # Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# -表示不包含,+表示包含
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored. # skip file: ftp: and mailto: urls
-^(file|ftp|mailto): # skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ # skip URLs containing certain characters as probable queries, etc.
#表示过滤包含指定字符的URL,这样是抓取不到包含?*!@=等字符的URL的,建议改为: -[~]
#-[?*!@=]
-[~]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else
#+.
# 过滤正则表达式,([a-z0-9]*\.)*表示任意数字和字母,[\s\S]*表示任意字符
+^http://([a-z0-9]*\.)*bbs.superwu.cn/[\s\S]*
抓取discuz论坛中的数据
+^http://bbs.superwu.cn/forum.php$
+^http://bbs.superwu.cn/forum.php?mod=forumdisplay&fid=/d+$
+^http://bbs.superwu.cn/forum.php?mod=forumdisplay&fid=/d+&page=/d+$
+^http://bbs.superwu.cn/forum.php?mod=viewthread&tid=/d+&extra=page%3D/d+$
+^http://bbs.superwu.cn/forum.php?mod=viewthread&tid=/d+&extra=page%3D/d+&page=/d+$
Nutch的nutch-default.xml和regex-urlfilter.txt的中文解释的更多相关文章
- struts2 default.xml详解
struts2 default.xml 内容 1 bean节点制定Struts在运行的时候创建的对象类型. 2 指定Struts-default 包 用户写的package(struts.xml) ...
- c#读取xml文件配置文件Winform及WebForm-Demo具体解释
我这里用Winform和WebForm两种为例说明怎样操作xml文档来作为配置文件进行读取操作. 1.新建一个类,命名为"SystemConfig.cs".代码例如以下: < ...
- Nutch配置:nutch-default.xml详解
/×××××××××××××××××××××××××××××××××××××××××/ Author:xxx0624 HomePage:http://www.cnblogs.com/xxx0624/ ...
- Using Xpath With Default XML Namespace in C#
If you have a XML file without any prefix in the namespace: <bookstore xmlns="http://www.con ...
- FusionCharts制作报表使用XML导入数据时出现的中文乱码问题
今天在使用FusionCharts制作报表时用XML导入数据,总是出现乱码问题,下面是我的解决方案. 让FusionCharts支持中文 刚刚将XML导入到html中后,在火狐浏览器一直报Invali ...
- C# 之三类文件的读写( .XML,.INI 和 .TXT 文件)
笔记之用,关于三类.xml, .ini, .txt 文件的 C# 读写,请多多指教! 1,第一类:.xml 文件的读写 先贴上xml文件,下面对这个文件进行操作: <?xml version=& ...
- xml文件中 xmlns xmlns:xsi 等解释
http://bbs.csdn.NET/topics/390751819 maven 的 pom.xml 开头是下面这样的 <project xmlns="http://maven.a ...
- 【Nutch2.3基础教程】集成Nutch/Hadoop/Hbase/Solr构建搜索引擎:安装及运行【集群环境】
1.下载相关软件,并解压 版本号如下: (1)apache-nutch-2.3 (2) hadoop-1.2.1 (3)hbase-0.92.1 (4)solr-4.9.0 并解压至/opt/jedi ...
- 【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行【单机环境】
1.下载相关软件,并解压 版本号如下: (1)apache-nutch-2.2.1 (2) hbase-0.90.4 (3)solr-4.9.0 并解压至/usr/search 2.Nutch的配置 ...
随机推荐
- C# 中的委托(Delegate)
委托(Delegate) 是存有对某个方法的引用的一种引用类型变量.引用可在运行时被改变. 委托(Delegate)特别用于实现事件和回调方法.所有的委托(Delegate)都派生自 System.D ...
- git——^和~的区别(转)
原文地址: http://www.cnblogs.com/softidea/p/4967607.html 一. 引子 在git操作中,我们可以使用checkout命令检出某个状态下文件,也可以使用re ...
- 最全js表单验证
/***************************************************************** 表单校验工具类 (linjq) ***************** ...
- centos7,配置nginx服务器
安装准备 首先由于nginx的一些模块依赖一些lib库,所以在安装nginx之前,必须先安装这些lib库,这些依赖库主要有g++.gcc.openssl-devel.pcre-devel和zlib-d ...
- Nios II——定制自己的IP1之Nios接口类型
信号自动识别的接口前缀 接口前缀 接口类型 asi Avalon-ST宿端口(输入) aso Avalon-ST源端口(输出) avm Avalon-MM主端口 avs Avalon-MM从端口 ax ...
- (转)mysql command line client打不开(闪一下消失)的解决办法
转自:http://www.2cto.com/database/201209/153858.html 网上搜索到的解决办法: 1.找到mysql安装目录下的bin目录路径. 2.打开cmd,进入到bi ...
- JS高程研读记录一【事件流】
事件流主要有冒泡事件.事件捕获及DOM事件流.现浏览器除了IE8及更早版外,基本支持DOM事件流. 冒泡事件由IE提出,而事件捕获则由Netscape提出.但两者却是截然相反的方案. 以DIV点击为例 ...
- Hdu1281 棋盘游戏
棋盘游戏 Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Others) Total Submi ...
- codeforce864d
D. Make a Permutation! time limit per test 2 seconds memory limit per test 256 megabytes input stand ...
- 数字签名、数字证书的原理以及证书的获得java版
数字签名原理简介(附数字证书) 首先要了解什么叫对称加密和非对称加密,消息摘要这些知识. 1. 非对称加密 在通信双方,如果使用非对称加密,一般遵从这样的原则:公钥加密,私钥解密.同时,一般一个密钥加 ...