nutch-default解释.xml

  1. <?xml version="1.0"?>
  2. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  3. <!--
  4. Licensed to the Apache Software Foundation (ASF) under one or more
  5. contributor license agreements. See the NOTICE file distributed with
  6. this work for additional information regarding copyright ownership.
  7. The ASF licenses this file to You under the Apache License, Version 2.0
  8. (the "License"); you may not use this file except in compliance with
  9. the License. You may obtain a copy of the License at
  10.  
  11. http://www.apache.org/licenses/LICENSE-2.0
  12.  
  13. Unless required by applicable law or agreed to in writing, software
  14. distributed under the License is distributed on an "AS IS" BASIS,
  15. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  16. See the License for the specific language governing permissions and
  17. limitations under the License.
  18. -->
  19. <!-- Do not modify this file directly. Instead, copy entries that you -->
  20. <!-- wish to modify from this file into nutch-site.xml and change them -->
  21. <!-- there. If nutch-site.xml does not already exist, create it. -->
  22.  
  23. <configuration>
  24.  
  25. <!-- general properties -->
  26.  
  27. <property>
  28. <name>store.ip.address</name>
  29. <value>false</value>
  30. <description>Enables us to capture the specific IP address
  31. (InetSocketAddress) of the host which we connect to via
  32. the given protocol. Currently supported is protocol-ftp and
  33. http.
  34. </description>
  35. </property>
  36.  
  37. <!-- file properties -->
  38.  
  39. <property>
  40. <name>file.content.limit</name>
  41. <value>65536</value>
  42. <description>The length limit for downloaded content using the file://
  43. protocol, in bytes. If this value is nonnegative (>=0), content longer
  44. than it will be truncated; otherwise, no truncation at all. Do not
  45. confuse this setting with the http.content.limit setting.
  46. 默认情况下,nutch只抓取网页的前65536个字节,之后的内容将被丢弃
  47. 但对于某些大型网站,首页的内容远远不止65536个字节,
  48. 甚至前面65536个字节里面均是一些布局信息,并没有任何的超链接,
  49. 所以可以把这个值设置的很大,或者直接设置为-1,表示不进行限制。
  50. </description>
  51. </property>
  52.  
  53. <property>
  54. <name>file.crawl.parent</name>
  55. <value>true</value>
  56. <description>The crawler is not restricted to the directories that you specified in the
  57. Urls file but it is jumping into the parent directories as well. For your own crawlings you can
  58. change this bahavior (set to false) the way that only directories beneath the directories that you specify get
  59. crawled.</description>
  60. </property>
  61.  
  62. <property>
  63. <name>file.crawl.redirect_noncanonical</name>
  64. <value>true</value>
  65. <description>
  66. If true, protocol-file treats non-canonical file names as
  67. redirects and does not canonicalize file names internally. A file
  68. name containing symbolic links as path elements is then not
  69. resolved and &quot;fetched&quot; but recorded as redirect with the
  70. canonical name (all links on path are resolved) as redirect
  71. target.
  72. </description>
  73. </property>
  74.  
  75. <property>
  76. <name>file.content.ignored</name>
  77. <value>true</value>
  78. <description>If true, no file content will be saved during fetch.
  79. And it is probably what we want to set most of time, since file:// URLs
  80. are meant to be local and we can always use them directly at parsing
  81. and indexing stages. Otherwise file contents will be saved.
  82. !! NO IMPLEMENTED YET !!
  83. </description>
  84. </property>
  85.  
  86. <!-- HTTP properties -->
  87.  
  88. <property>
  89. <name>http.agent.name</name>
  90. <value></value>
  91. <description>HTTP 'User-Agent' request header. MUST NOT be empty -
  92. please set this to a single word uniquely related to your organization.
  93.  
  94. NOTE: You should also check other related properties:
  95.  
  96. http.robots.agents
  97. http.agent.description
  98. http.agent.url
  99. http.agent.email
  100. http.agent.version
  101.  
  102. and set their values appropriately.
  103.  
  104. </description>
  105. </property>
  106.  
  107. <property>
  108. <name>http.robots.agents</name>
  109. <value></value>
  110. <description>Any other agents, apart from 'http.agent.name', that the robots
  111. parser would look for in robots.txt. Multiple agents can be provided using
  112. comma as a delimiter. eg. mybot,foo-spider,bar-crawler
  113.  
  114. The ordering of agents does NOT matter and the robots parser would make
  115. decision based on the agent which matches first to the robots rules.
  116. Also, there is NO need to add a wildcard (ie. "*") to this string as the
  117. robots parser would smartly take care of a no-match situation.
  118.  
  119. If no value is specified, by default HTTP agent (ie. 'http.agent.name')
  120. would be used for user agent matching by the robots parser.
  121. </description>
  122. </property>
  123.  
  124. <property>
  125. <name>http.robot.rules.whitelist</name>
  126. <value></value>
  127. <description>Comma separated list of hostnames or IP addresses to ignore
  128. robot rules parsing for. Use with care and only if you are explicitly
  129. allowed by the site owner to ignore the site's robots.txt!
  130. </description>
  131. </property>
  132.  
  133. <property>
  134. <name>http.robots.403.allow</name>
  135. <value>true</value>
  136. <description>Some servers return HTTP status 403 (Forbidden) if
  137. /robots.txt doesn't exist. This should probably mean that we are
  138. allowed to crawl the site nonetheless. If this is set to false,
  139. then such sites will be treated as forbidden.</description>
  140. </property>
  141.  
  142. <property>
  143. <name>http.agent.description</name>
  144. <value></value>
  145. <description>Further description of our bot- this text is used in
  146. the User-Agent header. It appears in parenthesis after the agent name.
  147. </description>
  148. </property>
  149.  
  150. <property>
  151. <name>http.agent.url</name>
  152. <value></value>
  153. <description>A URL to advertise in the User-Agent header. This will
  154. appear in parenthesis after the agent name. Custom dictates that this
  155. should be a URL of a page explaining the purpose and behavior of this
  156. crawler.
  157. </description>
  158. </property>
  159.  
  160. <property>
  161. <name>http.agent.email</name>
  162. <value></value>
  163. <description>An email address to advertise in the HTTP 'From' request
  164. header and User-Agent header. A good practice is to mangle this
  165. address (e.g. 'info at example dot com') to avoid spamming.
  166. </description>
  167. </property>
  168.  
  169. <property>
  170. <name>http.agent.version</name>
  171. <value>Nutch-1.10</value>
  172. <description>A version string to advertise in the User-Agent
  173. header.</description>
  174. </property>
  175.  
  176. <property>
  177. <name>http.agent.rotate</name>
  178. <value>false</value>
  179. <description>
  180. If true, instead of http.agent.name, alternating agent names are
  181. chosen from a list provided via http.agent.rotate.file.
  182. </description>
  183. </property>
  184.  
  185. <property>
  186. <name>http.agent.rotate.file</name>
  187. <value>agents.txt</value>
  188. <description>
  189. File containing alternative user agent names to be used instead of
  190. http.agent.name on a rotating basis if http.agent.rotate is true.
  191. Each line of the file should contain exactly one agent
  192. specification including name, version, description, URL, etc.
  193. </description>
  194. </property>
  195.  
  196. <property>
  197. <name>http.agent.host</name>
  198. <value></value>
  199. <description>Name or IP address of the host on which the Nutch crawler
  200. would be running. Currently this is used by 'protocol-httpclient'
  201. plugin.
  202. </description>
  203. </property>
  204.  
  205. <property>
  206. <name>http.timeout</name>
  207. <value>10000</value>
  208. <description>The default network timeout, in milliseconds.</description>
  209. </property>
  210.  
  211. <property>
  212. <name>http.max.delays</name>
  213. <value>100</value>
  214. <description>The number of times a thread will delay when trying to
  215. fetch a page. Each time it finds that a host is busy, it will wait
  216. fetcher.server.delay. After http.max.delays attepts, it will give
  217. up on the page for now.
  218. 爬虫的网络延时线程等待时间,以秒计时,默认的配时间是3秒,视网络状况而定。如果
  219. 在爬虫运行的时候发现服务器返回了主机忙消息,则等待时间由fetcher.server.delay决定,
  220. 所以在网络状况不太好的情况下fetcher.server.delay也设置稍大一点的值较好,
  221. 此外还有一个http.timeout也和网络状况有关系
  222. </description>
  223. </property>
  224.  
  225. <property>
  226. <name>http.content.limit</name>
  227. <value>65536</value>
  228. <description>The length limit for downloaded content using the http://
  229. protocol, in bytes. If this value is nonnegative (>=0), content longer
  230. than it will be truncated; otherwise, no truncation at all. Do not
  231. confuse this setting with the file.content.limit setting.
  232. 描述爬虫抓取的文档内容长度的配置项。默认值是65536,
  233. 也就是说抓取到的一个文档截取65KB左右,超过部分将被忽略,
  234. 对于抓取特定内容的搜索引擎需要修改此项,比如XML文档。
  235. 设置为-1表示不限制。
  236. </description>
  237. </property>
  238.  
  239. <!--下面这四个属性是用来设置代理地址和端口,如果代理需要密码的话还需要设置用户名和密码 -->
  240. <property>
  241. <name>http.proxy.host</name>
  242. <value></value>
  243. <description>The proxy hostname. If empty, no proxy is used.</description>
  244. </property>
  245.  
  246. <property>
  247. <name>http.proxy.port</name>
  248. <value></value>
  249. <description>The proxy port.</description>
  250. </property>
  251.  
  252. <property>
  253. <name>http.proxy.username</name>
  254. <value></value>
  255. <description>Username for proxy. This will be used by
  256. 'protocol-httpclient', if the proxy server requests basic, digest
  257. and/or NTLM authentication. To use this, 'protocol-httpclient' must
  258. be present in the value of 'plugin.includes' property.
  259. NOTE: For NTLM authentication, do not prefix the username with the
  260. domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
  261. </description>
  262. </property>
  263.  
  264. <property>
  265. <name>http.proxy.password</name>
  266. <value></value>
  267. <description>Password for proxy. This will be used by
  268. 'protocol-httpclient', if the proxy server requests basic, digest
  269. and/or NTLM authentication. To use this, 'protocol-httpclient' must
  270. be present in the value of 'plugin.includes' property.
  271. </description>
  272. </property>
  273.  
  274. <property>
  275. <name>http.proxy.realm</name>
  276. <value></value>
  277. <description>Authentication realm for proxy. Do not define a value
  278. if realm is not required or authentication should take place for any
  279. realm. NTLM does not use the notion of realms. Specify the domain name
  280. of NTLM authentication as the value for this property. To use this,
  281. 'protocol-httpclient' must be present in the value of
  282. 'plugin.includes' property.
  283. </description>
  284. </property>
  285.  
  286. <property>
  287. <name>http.auth.file</name>
  288. <value>httpclient-auth.xml</value>
  289. <description>Authentication configuration file for
  290. 'protocol-httpclient' plugin.
  291. </description>
  292. </property>
  293.  
  294. <property>
  295. <name>http.verbose</name>
  296. <value>false</value>
  297. <description>If true, HTTP will log more verbosely.</description>
  298. </property>
  299.  
  300. <property>
  301. <name>http.redirect.max</name>
  302. <value>0</value>
  303. <description>The maximum number of redirects the fetcher will follow when
  304. trying to fetch a page. If set to negative or 0, fetcher won't immediately
  305. follow redirected URLs, instead it will record them for later fetching.
  306. </description>
  307. </property>
  308.  
  309. <property>
  310. <name>http.useHttp11</name>
  311. <value>false</value>
  312. <description>NOTE: at the moment this works only for protocol-httpclient.
  313. If true, use HTTP 1.1, if false use HTTP 1.0 .
  314. </description>
  315. </property>
  316.  
  317. <property>
  318. <name>http.accept.language</name>
  319. <value>en-us,en-gb,en;q=0.7,*;q=0.3</value>
  320. <description>Value of the "Accept-Language" request header field.
  321. This allows selecting non-English language as default one to retrieve.
  322. It is a useful setting for search engines build for certain national group.
  323. </description>
  324. </property>
  325.  
  326. <property>
  327. <name>http.accept</name>
  328. <value>text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value>
  329. <description>Value of the "Accept" request header field.
  330. </description>
  331. </property>
  332.  
  333. <property>
  334. <name>http.store.responsetime</name>
  335. <value>true</value>
  336. <description>Enables us to record the response time of the
  337. host which is the time period between start connection to end
  338. connection of a pages host. The response time in milliseconds
  339. is stored in CrawlDb in CrawlDatum's meta data under key &quot;_rs_&quot;
  340. </description>
  341. </property>
  342.  
  343. <property>
  344. <name>http.enable.if.modified.since.header</name>
  345. <value>true</value>
  346. <description>Whether Nutch sends an HTTP If-Modified-Since header. It reduces
  347. bandwidth when enabled by not downloading pages that respond with an HTTP
  348. Not-Modified header. URL's that are not downloaded are not passed through
  349. parse or indexing filters. If you regularly modify filters, you should force
  350. Nutch to also download unmodified pages by disabling this feature.
  351. </description>
  352. </property>
  353.  
  354. <!-- FTP properties -->
  355.  
  356. <property>
  357. <name>ftp.username</name>
  358. <value>anonymous</value>
  359. <description>ftp login username.</description>
  360. </property>
  361.  
  362. <property>
  363. <name>ftp.password</name>
  364. <value>anonymous@example.com</value>
  365. <description>ftp login password.</description>
  366. </property>
  367.  
  368. <property>
  369. <name>ftp.content.limit</name>
  370. <value>65536</value>
  371. <description>The length limit for downloaded content, in bytes.
  372. If this value is nonnegative (>=0), content longer than it will be truncated;
  373. otherwise, no truncation at all.
  374. Caution: classical ftp RFCs never defines partial transfer and, in fact,
  375. some ftp servers out there do not handle client side forced close-down very
  376. well. Our implementation tries its best to handle such situations smoothly.
  377. 默认情况下,nutch只抓取网页的前65536个字节,之后的内容将被丢弃。
  378. 但对于某些大型网站,首页的内容远远不止65536个字节,
  379. 甚至前面65536个字节里面均是一些布局信息,并没有任何的超链接。
  380. 设置为-1表示不限制。
  381. </description>
  382. </property>
  383.  
  384. <property>
  385. <name>ftp.timeout</name>
  386. <value>60000</value>
  387. <description>Default timeout for ftp client socket, in millisec.
  388. Please also see ftp.keep.connection below.</description>
  389. </property>
  390.  
  391. <property>
  392. <name>ftp.server.timeout</name>
  393. <value>100000</value>
  394. <description>An estimation of ftp server idle time, in millisec.
  395. Typically it is 120000 millisec for many ftp servers out there.
  396. Better be conservative here. Together with ftp.timeout, it is used to
  397. decide if we need to delete (annihilate) current ftp.client instance and
  398. force to start another ftp.client instance anew. This is necessary because
  399. a fetcher thread may not be able to obtain next request from queue in time
  400. (due to idleness) before our ftp client times out or remote server
  401. disconnects. Used only when ftp.keep.connection is true (please see below).
  402. </description>
  403. </property>
  404.  
  405. <property>
  406. <name>ftp.keep.connection</name>
  407. <value>false</value>
  408. <description>Whether to keep ftp connection. Useful if crawling same host
  409. again and again. When set to true, it avoids connection, login and dir list
  410. parser setup for subsequent urls. If it is set to true, however, you must
  411. make sure (roughly):
  412. (1) ftp.timeout is less than ftp.server.timeout
  413. (2) ftp.timeout is larger than (fetcher.threads.fetch * fetcher.server.delay)
  414. Otherwise there will be too many "delete client because idled too long"
  415. messages in thread logs.</description>
  416. </property>
  417.  
  418. <property>
  419. <name>ftp.follow.talk</name>
  420. <value>false</value>
  421. <description>Whether to log dialogue between our client and remote
  422. server. Useful for debugging.</description>
  423. </property>
  424.  
  425. <!-- web db properties -->
  426. <property>
  427. <name>db.fetch.interval.default</name>
  428. <value>2592000</value>
  429. <description>The default number of seconds between re-fetches of a page (30 days).
  430. 这个功能对定期自动爬取需求的开发有用,设置多少天重新爬一个页面,默认2592000秒,即30天
  431. </description>
  432. </property>
  433.  
  434. <property>
  435. <name>db.fetch.interval.max</name>
  436. <value>7776000</value>
  437. <description>The maximum number of seconds between re-fetches of a page
  438. (90 days). After this period every page in the db will be re-tried, no
  439. matter what is its status.
  440. </description>
  441. </property>
  442.  
  443. <property>
  444. <name>db.fetch.schedule.class</name>
  445. <value>org.apache.nutch.crawl.DefaultFetchSchedule</value>
  446. <description>The implementation of fetch schedule. DefaultFetchSchedule simply
  447. adds the original fetchInterval to the last fetch time, regardless of
  448. page changes.</description>
  449. </property>
  450.  
  451. <property>
  452. <name>db.fetch.schedule.adaptive.inc_rate</name>
  453. <value>0.4</value>
  454. <description>If a page is unmodified, its fetchInterval will be
  455. increased by this rate. This value should not
  456. exceed 0.5, otherwise the algorithm becomes unstable.</description>
  457. </property>
  458.  
  459. <property>
  460. <name>db.fetch.schedule.adaptive.dec_rate</name>
  461. <value>0.2</value>
  462. <description>If a page is modified, its fetchInterval will be
  463. decreased by this rate. This value should not
  464. exceed 0.5, otherwise the algorithm becomes unstable.</description>
  465. </property>
  466.  
  467. <property>
  468. <name>db.fetch.schedule.adaptive.min_interval</name>
  469. <value>60.0</value>
  470. <description>Minimum fetchInterval, in seconds.</description>
  471. </property>
  472.  
  473. <property>
  474. <name>db.fetch.schedule.adaptive.max_interval</name>
  475. <value>31536000.0</value>
  476. <description>Maximum fetchInterval, in seconds (365 days).
  477. NOTE: this is limited by db.fetch.interval.max. Pages with
  478. fetchInterval larger than db.fetch.interval.max
  479. will be fetched anyway.</description>
  480. </property>
  481.  
  482. <property>
  483. <name>db.fetch.schedule.adaptive.sync_delta</name>
  484. <value>true</value>
  485. <description>If true, try to synchronize with the time of page change.
  486. by shifting the next fetchTime by a fraction (sync_rate) of the difference
  487. between the last modification time, and the last fetch time.</description>
  488. </property>
  489.  
  490. <property>
  491. <name>db.fetch.schedule.adaptive.sync_delta_rate</name>
  492. <value>0.3</value>
  493. <description>See sync_delta for description. This value should not
  494. exceed 0.5, otherwise the algorithm becomes unstable.</description>
  495. </property>
  496.  
  497. <property>
  498. <name>db.fetch.schedule.mime.file</name>
  499. <value>adaptive-mimetypes.txt</value>
  500. <description>The configuration file for the MimeAdaptiveFetchSchedule.
  501. </description>
  502. </property>
  503.  
  504. <property>
  505. <name>db.update.additions.allowed</name>
  506. <value>true</value>
  507. <description>If true, updatedb will add newly discovered URLs, if false
  508. only already existing URLs in the CrawlDb will be updated and no new
  509. URLs will be added.
  510. </description>
  511. </property>
  512.  
  513. <property>
  514. <name>db.preserve.backup</name>
  515. <value>true</value>
  516. <description>If true, updatedb will keep a backup of the previous CrawlDB
  517. version in the old directory. In case of disaster, one can rename old to
  518. current and restore the CrawlDB to its previous state.
  519. </description>
  520. </property>
  521.  
  522. <property>
  523. <name>db.update.purge.404</name>
  524. <value>false</value>
  525. <description>If true, updatedb will add purge records with status DB_GONE
  526. from the CrawlDB.
  527. </description>
  528. </property>
  529.  
  530. <property>
  531. <name>db.url.normalizers</name>
  532. <value>false</value>
  533. <description>Normalize urls when updating crawldb</description>
  534. </property>
  535.  
  536. <property>
  537. <name>db.url.filters</name>
  538. <value>false</value>
  539. <description>Filter urls when updating crawldb</description>
  540. </property>
  541.  
  542. <property>
  543. <name>db.update.max.inlinks</name>
  544. <value>10000</value>
  545. <description>Maximum number of inlinks to take into account when updating
  546. a URL score in the crawlDB. Only the best scoring inlinks are kept.
  547. </description>
  548. </property>
  549.  
  550. <property>
  551. <name>db.ignore.internal.links</name>
  552. <value>true</value>
  553. <description>If true, when adding new links to a page, links from
  554. the same host are ignored. This is an effective way to limit the
  555. size of the link database, keeping only the highest quality
  556. links.
  557. </description>
  558. </property>
  559.  
  560. <property>
  561. <name>db.ignore.external.links</name>
  562. <value>false</value>
  563. <description>If true, outlinks leading from a page to external hosts
  564. will be ignored. This is an effective way to limit the crawl to include
  565. only initially injected hosts, without creating complex URLFilters.
  566. 若为true,则只抓取本域名内的网页,忽略外部链接。
  567. 可以在 regex-urlfilter.txt中增加过滤器达到同样效果,
  568. 但如果过滤器过多,如几千个,则会大大影响nutch的性能。
  569. </description>
  570. </property>
  571.  
  572. <property>
  573. <name>db.injector.overwrite</name>
  574. <value>false</value>
  575. <description>Whether existing records in the CrawlDB will be overwritten
  576. by injected records.
  577. </description>
  578. </property>
  579.  
  580. <property>
  581. <name>db.injector.update</name>
  582. <value>false</value>
  583. <description>If true existing records in the CrawlDB will be updated with
  584. injected records. Old meta data is preserved. The db.injector.overwrite
  585. parameter has precedence.
  586. </description>
  587. </property>
  588.  
  589. <property>
  590. <name>db.score.injected</name>
  591. <value>1.0</value>
  592. <description>The score of new pages added by the injector.
  593. 注入时url的默认网页得分(重要程度)
  594. </description>
  595. </property>
  596.  
  597. <property>
  598. <name>db.score.link.external</name>
  599. <value>1.0</value>
  600. <description>The score factor for new pages added due to a link from
  601. another host relative to the referencing page's score. Scoring plugins
  602. may use this value to affect initial scores of external links.
  603. </description>
  604. </property>
  605.  
  606. <property>
  607. <name>db.score.link.internal</name>
  608. <value>1.0</value>
  609. <description>The score factor for pages added due to a link from the
  610. same host, relative to the referencing page's score. Scoring plugins
  611. may use this value to affect initial scores of internal links.
  612. </description>
  613. </property>
  614.  
  615. <property>
  616. <name>db.score.count.filtered</name>
  617. <value>false</value>
  618. <description>The score value passed to newly discovered pages is
  619. calculated as a fraction of the original page score divided by the
  620. number of outlinks. If this option is false, only the outlinks that passed
  621. URLFilters will count, if it's true then all outlinks will count.
  622. </description>
  623. </property>
  624.  
  625. <property>
  626. <name>db.max.inlinks</name>
  627. <value>10000</value>
  628. <description>Maximum number of Inlinks per URL to be kept in LinkDb.
  629. If "invertlinks" finds more inlinks than this number, only the first
  630. N inlinks will be stored, and the rest will be discarded.
  631. </description>
  632. </property>
  633.  
  634. <property>
  635. <name>db.max.outlinks.per.page</name>
  636. <value>100</value>
  637. <description>The maximum number of outlinks that we'll process for a page.
  638. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  639. will be processed for a page; otherwise, all outlinks will be processed.
  640. 默认情况下,Nutch只抓取某个网页的100个外部链接,导致部分链接无法抓取。
  641. 若要改变此情况,可以修改此配置项,可以增大 或者设置为-1,-1表示不限制。
  642. </description>
  643. </property>
  644.  
  645. <property>
  646. <name>db.max.anchor.length</name>
  647. <value>100</value>
  648. <description>The maximum number of characters permitted in an anchor.
  649. </description>
  650. </property>
  651.  
  652. <property>
  653. <name>db.parsemeta.to.crawldb</name>
  654. <value></value>
  655. <description>Comma-separated list of parse metadata keys to transfer to the crawldb (NUTCH-779).
  656. Assuming for instance that the languageidentifier plugin is enabled, setting the value to 'lang'
  657. will copy both the key 'lang' and its value to the corresponding entry in the crawldb.
  658. </description>
  659. </property>
  660.  
  661. <property>
  662. <name>db.fetch.retry.max</name>
  663. <value>3</value>
  664. <description>The maximum number of times a url that has encountered
  665. recoverable errors is generated for fetch.</description>
  666. </property>
  667.  
  668. <property>
  669. <name>db.signature.class</name>
  670. <value>org.apache.nutch.crawl.MD5Signature</value>
  671. <description>The default implementation of a page signature. Signatures
  672. created with this implementation will be used for duplicate detection
  673. and removal.</description>
  674. </property>
  675.  
  676. <property>
  677. <name>db.signature.text_profile.min_token_len</name>
  678. <value>2</value>
  679. <description>Minimum token length to be included in the signature.
  680. </description>
  681. </property>
  682.  
  683. <property>
  684. <name>db.signature.text_profile.quant_rate</name>
  685. <value>0.01</value>
  686. <description>Profile frequencies will be rounded down to a multiple of
  687. QUANT = (int)(QUANT_RATE * maxFreq), where maxFreq is a maximum token
  688. frequency. If maxFreq > 1 then QUANT will be at least 2, which means that
  689. for longer texts tokens with frequency 1 will always be discarded.
  690. </description>
  691. </property>
  692.  
  693. <!-- generate properties -->
  694.  
  695. <property>
  696. <name>generate.max.count</name>
  697. <value>-1</value>
  698. <description>The maximum number of urls in a single
  699. fetchlist. -1 if unlimited. The urls are counted according
  700. to the value of the parameter generator.count.mode.
  701.  
  702. 与generate.count.mode相配合,限制生成的每个fetchlist里属于
  703. 同一个host/domain/ip的URL最多为(generate.max.count-1)个
  704.  
  705. -1,表示不限制一个fetchlist里面有多少个属于同一个host/domain/ip的url
  706. </description>
  707. </property>
  708.  
  709. <property>
  710. <name>generate.count.mode</name>
  711. <value>host</value>
  712. <description>Determines how the URLs are counted for generator.max.count.
  713. Default value is 'host' but can be 'domain'. Note that we do not count
  714. per IP in the new version of the Generator.
  715.  
  716. byHost/byDomain/byIP三种,表示按照何种方式计数,
  717. 以达到generate.max.count指定的数量
  718.  
  719. byHost,即根据host来对每个fetchlist中的url进行计数,
  720. 同一个segment里面属于同一个host的url不能超过generate.max.count,
  721. 如果超过,则需要将其他属于该host的url放到新的fetchlist中(如果还有新的fetchlist未放满的话)
  722. </description>
  723. </property>
  724.  
  725. <property>
  726. <name>generate.update.crawldb</name>
  727. <value>false</value>
  728. <description>For highly-concurrent environments, where several
  729. generate/fetch/update cycles may overlap, setting this to true ensures
  730. that generate will create different fetchlists even without intervening
  731. updatedb-s, at the cost of running an additional job to update CrawlDB.
  732. If false, running generate twice without intervening
  733. updatedb will generate identical fetchlists.
  734.  
  735. 是否在generator完成之后,更新crawldb,主要是更新CrawlDatum的_ngt_字段
  736. 为此次执行generator的时间,防止下次generator
  737. (由参数crawl.gen.delay指定的时间之内开始的另一个generator),
  738. 加入相同的url(备注:即使下次generator加入相同的url,也不会造成逻辑错误,
  739. 只是会浪费资源,重复爬取相同URL)
  740. </description>
  741. </property>
  742.  
  743. <property>
  744. <name>generate.min.score</name>
  745. <value>0</value>
  746. <description>Select only entries with a score larger than
  747. generate.min.score.
  748.  
  749. 如果经过ScoreFilters之后,url的score值(反应网页重要性的值,类似于PageRank值)
  750. 仍然小于generate.min.score值,则该url不加入fetchlist中(即跳过该URL),
  751. 配置了改值,表明generator只会考虑将较重要的网页加入到fetchlist
  752.  
  753. 0,表明所有url都不会因为score值在generator阶段被过滤掉
  754. </description>
  755. </property>
  756.  
  757. <property>
  758. <name>generate.min.interval</name>
  759. <value>-1</value>
  760. <description>Select only entries with a retry interval lower than
  761. generate.min.interval. A value of -1 disables this check.
  762. 设置该值表示generator只考虑需要频繁采集的url(即:CrawlDatum的fetchInterval较小),
  763. 对于不需要频繁采集的url,不加入到fetchlist
  764. -1,表明禁用该检查
  765. </description>
  766. </property>
  767.  
  768. <!-- urlpartitioner properties -->
  769.  
  770. <property>
  771. <name>partition.url.mode</name>
  772. <value>byHost</value>
  773. <description>Determines how to partition URLs. Default value is 'byHost',
  774. also takes 'byDomain' or 'byIP'.
  775. 这个配置用来设定mapper操作以后,partition操作根据Host进行Hash。
  776. 结果是具有相同Host的URL会被打到同一个Reduce节点上面
  777.  
  778. 在对生成的fetchlist做划分(partition)的时候,划分的方式是什么,有如下3中:byHost/byDomain/byIP
  779. </description>
  780. </property>
  781.  
  782. <property>
  783. <name>crawl.gen.delay</name>
  784. <value>604800000</value>
  785. <description>
  786. This value, expressed in milliseconds, defines how long we should keep the lock on records
  787. in CrawlDb that were just selected for fetching. If these records are not updated
  788. in the meantime, the lock is canceled, i.e. they become eligible for selecting.
  789. Default value of this is 7 days (604800000 ms).
  790.  
  791. generator执行时,会使用“_ngt_”(stand for ”nutch generate time“)
  792. 作为key来来存储上一次对该url调用generator的时间,表明该url已经加入到了某个fetchlist,
  793. 并可能正在完成fetch->updated的过程当中,而可能这个过程时间较长,也或者过程中出错了,
  794. 而generator执行的过程当中。在考虑该url是否能加入此次的fetchlist时,
  795. 需要一种机制来判断是否能将该url加入还是继续等待之前的fetch->updatedb流程完成
  796. (这样crawldb中该url的_ngt_会被更新成上次成功执行generator的时间。
  797. crawl.gen.deley就是用来解决该问题的,如果”_ngt_”+crawl.gen.delay 小于 当前时间,
  798. 则该url可以加入到本次生成的fetchlist中;否则,不加入(跳过该url)
  799. </description>
  800. </property>
  801.  
  802. <!-- fetcher properties -->
  803.  
  804. <property>
  805. <name>fetcher.server.delay</name>
  806. <value>5.0</value>
  807. <description>The number of seconds the fetcher will delay between
  808. successive requests to the same server. Note that this might get
  809. overriden by a Crawl-Delay from a robots.txt and is used ONLY if
  810. fetcher.threads.per.queue is set to 1.
  811. </description>
  812. </property>
  813.  
  814. <property>
  815. <name>fetcher.server.min.delay</name>
  816. <value>0.0</value>
  817. <description>The minimum number of seconds the fetcher will delay between
  818. successive requests to the same server. This value is applicable ONLY
  819. if fetcher.threads.per.queue is greater than 1 (i.e. the host blocking
  820. is turned off).</description>
  821. </property>
  822.  
  823. <property>
  824. <name>fetcher.max.crawl.delay</name>
  825. <value>30</value>
  826. <description>
  827. If the Crawl-Delay in robots.txt is set to greater than this value (in
  828. seconds) then the fetcher will skip this page, generating an error report.
  829. If set to -1 the fetcher will never skip such pages and will wait the
  830. amount of time retrieved from robots.txt Crawl-Delay, however long that
  831. might be.
  832. </description>
  833. </property>
  834.  
  835. <property>
  836. <name>fetcher.threads.fetch</name>
  837. <value>10</value>
  838. <description>The number of FetcherThreads the fetcher should use.
  839. This is also determines the maximum number of requests that are
  840. made at once (each FetcherThread handles one connection). The total
  841. number of threads running in distributed mode will be the number of
  842. fetcher threads * number of nodes as fetcher has one map task per node.
  843. 最大抓取线程数量
  844. </description>
  845. </property>
  846.  
  847. <property>
  848. <name>fetcher.threads.per.queue</name>
  849. <value>1</value>
  850. <description>This number is the maximum number of threads that
  851. should be allowed to access a queue at one time. Setting it to
  852. a value > 1 will cause the Crawl-Delay value from robots.txt to
  853. be ignored and the value of fetcher.server.min.delay to be used
  854. as a delay between successive requests to the same server instead
  855. of fetcher.server.delay.
  856. </description>
  857. </property>
  858.  
  859. <property>
  860. <name>fetcher.queue.mode</name>
  861. <value>byHost</value>
  862. <description>Determines how to put URLs into queues. Default value is 'byHost',
  863. also takes 'byDomain' or 'byIP'.
  864. </description>
  865. </property>
  866.  
  867. <property>
  868. <name>fetcher.verbose</name>
  869. <value>false</value>
  870. <description>If true, fetcher will log more verbosely.
  871. 如果是true,打印出更多详细信息
  872. </description>
  873. </property>
  874.  
  875. <property>
  876. <name>fetcher.parse</name>
  877. <value>false</value>
  878. <description>If true, fetcher will parse content. Default is false, which means
  879. that a separate parsing step is required after fetching is finished.
  880. 能否在抓取的同时进行解释:可以,但不 建议这样做。
  881. </description>
  882. </property>
  883.  
  884. <property>
  885. <name>fetcher.store.content</name>
  886. <value>true</value>
  887. <description>If true, fetcher will store content.</description>
  888. </property>
  889.  
  890. <property>
  891. <name>fetcher.timelimit.mins</name>
  892. <value>-1</value>
  893. <description>This is the number of minutes allocated to the fetching.
  894. Once this value is reached, any remaining entry from the input URL list is skipped
  895. and all active queues are emptied. The default value of -1 deactivates the time limit.
  896. </description>
  897. </property>
  898.  
  899. <property>
  900. <name>fetcher.max.exceptions.per.queue</name>
  901. <value>-1</value>
  902. <description>The maximum number of protocol-level exceptions (e.g. timeouts) per
  903. host (or IP) queue. Once this value is reached, any remaining entries from this
  904. queue are purged, effectively stopping the fetching from this host/IP. The default
  905. value of -1 deactivates this limit.
  906. </description>
  907. </property>
  908.  
  909. <property>
  910. <name>fetcher.throughput.threshold.pages</name>
  911. <value>-1</value>
  912. <description>The threshold of minimum pages per second. If the fetcher downloads less
  913. pages per second than the configured threshold, the fetcher stops, preventing slow queue's
  914. from stalling the throughput. This threshold must be an integer. This can be useful when
  915. fetcher.timelimit.mins is hard to determine. The default value of -1 disables this check.
  916. </description>
  917. </property>
  918.  
  919. <property>
  920. <name>fetcher.throughput.threshold.retries</name>
  921. <value>5</value>
  922. <description>The number of times the fetcher.throughput.threshold is allowed to be exceeded.
  923. This settings prevents accidental slow downs from immediately killing the fetcher thread.
  924. </description>
  925. </property>
  926.  
  927. <property>
  928. <name>fetcher.throughput.threshold.check.after</name>
  929. <value>5</value>
  930. <description>The number of minutes after which the throughput check is enabled.</description>
  931. </property>
  932.  
  933. <property>
  934. <name>fetcher.threads.timeout.divisor</name>
  935. <value>2</value>
  936. <description>(EXPERT)The thread time-out divisor to use. By default threads have a time-out
  937. value of mapred.task.timeout / 2. Increase this setting if the fetcher waits too
  938. long before killing hanged threads. Be careful, a too high setting (+8) will most likely kill the
  939. fetcher threads prematurely.
  940. </description>
  941. </property>
  942.  
  943. <property>
  944. <name>fetcher.queue.depth.multiplier</name>
  945. <value>50</value>
  946. <description>(EXPERT)The fetcher buffers the incoming URLs into queues based on the [host|domain|IP]
  947. (see param fetcher.queue.mode). The depth of the queue is the number of threads times the value of this parameter.
  948. A large value requires more memory but can improve the performance of the fetch when the order of the URLS in the fetch list
  949. is not optimal.
  950. </description>
  951. </property>
  952.  
  953. <property>
  954. <name>fetcher.follow.outlinks.depth</name>
  955. <value>-1</value>
  956. <description>(EXPERT)When fetcher.parse is true and this value is greater than 0 the fetcher will extract outlinks
  957. and follow until the desired depth is reached. A value of 1 means all generated pages are fetched and their first degree
  958. outlinks are fetched and parsed too. Be careful, this feature is in itself agnostic of the state of the CrawlDB and does not
  959. know about already fetched pages. A setting larger than 2 will most likely fetch home pages twice in the same fetch cycle.
  960. It is highly recommended to set db.ignore.external.links to true to restrict the outlink follower to URL's within the same
  961. domain. When disabled (false) the feature is likely to follow duplicates even when depth=1.
  962. A value of -1 of 0 disables this feature.
  963. </description>
  964. </property>
  965.  
  966. <property>
  967. <name>fetcher.follow.outlinks.num.links</name>
  968. <value>4</value>
  969. <description>(EXPERT)The number of outlinks to follow when fetcher.follow.outlinks.depth is enabled. Be careful, this can multiply
  970. the total number of pages to fetch. This works with fetcher.follow.outlinks.depth.divisor, by default settings the followed outlinks
  971. at depth 1 is 8, not 4.
  972. </description>
  973. </property>
  974.  
  975. <property>
  976. <name>fetcher.follow.outlinks.depth.divisor</name>
  977. <value>2</value>
  978. <description>(EXPERT)The divisor of fetcher.follow.outlinks.num.links per fetcher.follow.outlinks.depth. This decreases the number
  979. of outlinks to follow by increasing depth. The formula used is: outlinks = floor(divisor / depth * num.links). This prevents
  980. exponential growth of the fetch list.
  981. </description>
  982. </property>
  983.  
  984. <property>
  985. <name>fetcher.follow.outlinks.ignore.external</name>
  986. <value>true</value>
  987. <description>Whether to ignore or follow external links. Set db.ignore.external.links to false and this to true to store outlinks
  988. in the output but not follow them. If db.ignore.external.links is true this directive is ignored.
  989. </description>
  990. </property>
  991.  
  992. <property>
  993. <name>fetcher.bandwidth.target</name>
  994. <value>-1</value>
  995. <description>Target bandwidth in kilobits per sec for each mapper instance. This is used to adjust the number of
  996. fetching threads automatically (up to fetcher.maxNum.threads). A value of -1 deactivates the functionality, in which case
  997. the number of fetching threads is fixed (see fetcher.threads.fetch).</description>
  998. </property>
  999.  
  1000. <property>
  1001. <name>fetcher.maxNum.threads</name>
  1002. <value>25</value>
  1003. <description>Max number of fetch threads allowed when using fetcher.bandwidth.target. Defaults to fetcher.threads.fetch if unspecified or
  1004. set to a value lower than it. </description>
  1005. </property>
  1006.  
  1007. <property>
  1008. <name>fetcher.bandwidth.target.check.everyNSecs</name>
  1009. <value>30</value>
  1010. <description>(EXPERT) Value in seconds which determines how frequently we should reassess the optimal number of fetch threads when using
  1011. fetcher.bandwidth.target. Defaults to 30 and must be at least 1.</description>
  1012. </property>
  1013.  
  1014. <!-- moreindexingfilter plugin properties -->
  1015.  
  1016. <property>
  1017. <name>moreIndexingFilter.indexMimeTypeParts</name>
  1018. <value>true</value>
  1019. <description>Determines whether the index-more plugin will split the mime-type
  1020. in sub parts, this requires the type field to be multi valued. Set to true for backward
  1021. compatibility. False will not split the mime-type.
  1022. </description>
  1023. </property>
  1024.  
  1025. <property>
  1026. <name>moreIndexingFilter.mapMimeTypes</name>
  1027. <value>false</value>
  1028. <description>Determines whether MIME-type mapping is enabled. It takes a
  1029. plain text file with mapped MIME-types. With it the user can map both
  1030. application/xhtml+xml and text/html to the same target MIME-type so it
  1031. can be treated equally in an index. See conf/contenttype-mapping.txt.
  1032. </description>
  1033. </property>
  1034.  
  1035. <!-- AnchorIndexing filter plugin properties -->
  1036.  
  1037. <property>
  1038. <name>anchorIndexingFilter.deduplicate</name>
  1039. <value>false</value>
  1040. <description>With this enabled the indexer will case-insensitive deduplicate anchors
  1041. before indexing. This prevents possible hundreds or thousands of identical anchors for
  1042. a given page to be indexed but will affect the search scoring (i.e. tf=1.0f).
  1043. </description>
  1044. </property>
  1045.  
  1046. <!-- indexingfilter plugin properties -->
  1047.  
  1048. <property>
  1049. <name>indexingfilter.order</name>
  1050. <value></value>
  1051. <description>The order by which index filters are applied.
  1052. If empty, all available index filters (as dictated by properties
  1053. plugin-includes and plugin-excludes above) are loaded and applied in system
  1054. defined order. If not empty, only named filters are loaded and applied
  1055. in given order. For example, if this property has value:
  1056. org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter
  1057. then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
  1058.  
  1059. Filter ordering might have impact on result if one filter depends on output of
  1060. another filter.
  1061. </description>
  1062. </property>
  1063.  
  1064. <property>
  1065. <name>indexer.score.power</name>
  1066. <value>0.5</value>
  1067. <description>Determines the power of link analyis scores. Each
  1068. pages's boost is set to <i>score<sup>scorePower</sup></i> where
  1069. <i>score</i> is its link analysis score and <i>scorePower</i> is the
  1070. value of this parameter. This is compiled into indexes, so, when
  1071. this is changed, pages must be re-indexed for it to take
  1072. effect.</description>
  1073. </property>
  1074.  
  1075. <property>
  1076. <name>indexer.max.title.length</name>
  1077. <value>100</value>
  1078. <description>The maximum number of characters of a title that are indexed. A value of -1 disables this check.
  1079. </description>
  1080. </property>
  1081.  
  1082. <property>
  1083. <name>indexer.max.content.length</name>
  1084. <value>-1</value>
  1085. <description>The maximum number of characters of a content that are indexed.
  1086. Content beyond the limit is truncated. A value of -1 disables this check.
  1087. </description>
  1088. </property>
  1089.  
  1090. <property>
  1091. <name>indexer.add.domain</name>
  1092. <value>false</value>
  1093. <description>Whether to add the domain field to a NutchDocument.</description>
  1094. </property>
  1095.  
  1096. <property>
  1097. <name>indexer.skip.notmodified</name>
  1098. <value>false</value>
  1099. <description>Whether the indexer will skip records with a db_notmodified status.
  1100. </description>
  1101. </property>
  1102.  
  1103. <!-- URL normalizer properties -->
  1104.  
  1105. <property>
  1106. <name>urlnormalizer.order</name>
  1107. <value>org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
  1108. <description>Order in which normalizers will run. If any of these isn't
  1109. activated it will be silently skipped. If other normalizers not on the
  1110. list are activated, they will run in random order after the ones
  1111. specified here are run.
  1112. </description>
  1113. </property>
  1114.  
  1115. <property>
  1116. <name>urlnormalizer.regex.file</name>
  1117. <value>regex-normalize.xml</value>
  1118. <description>Name of the config file used by the RegexUrlNormalizer class.
  1119. </description>
  1120. </property>
  1121.  
  1122. <property>
  1123. <name>urlnormalizer.loop.count</name>
  1124. <value>1</value>
  1125. <description>Optionally loop through normalizers several times, to make
  1126. sure that all transformations have been performed.
  1127. </description>
  1128. </property>
  1129.  
  1130. <!-- mime properties -->
  1131.  
  1132. <!--
  1133. <property>
  1134. <name>mime.types.file</name>
  1135. <value>tika-mimetypes.xml</value>
  1136. <description>Name of file in CLASSPATH containing filename extension and
  1137. magic sequence to mime types mapping information. Overrides the default Tika config
  1138. if specified.
  1139. </description>
  1140. </property>
  1141. -->
  1142.  
  1143. <property>
  1144. <name>mime.type.magic</name>
  1145. <value>true</value>
  1146. <description>Defines if the mime content type detector uses magic resolution.
  1147. </description>
  1148. </property>
  1149.  
  1150. <!-- plugin properties -->
  1151.  
  1152. <property>
  1153. <name>plugin.folders</name>
  1154. <value>plugins</value>
  1155. <description>Directories where nutch plugins are located. Each
  1156. element may be a relative or absolute path. If absolute, it is used
  1157. as is. If relative, it is searched for on the classpath.
  1158. 这个属性是用来指定plugin的目录,在eclipse中执行时需要改为:./src/plugin
  1159. 但是在分布式集群运行打成的JOB包时,需要改为plugins
  1160. </description>
  1161. </property>
  1162.  
  1163. <property>
  1164. <name>plugin.auto-activation</name>
  1165. <value>true</value>
  1166. <description>Defines if some plugins that are not activated regarding
  1167. the plugin.includes and plugin.excludes properties must be automaticaly
  1168. activated if they are needed by some actived plugins.
  1169. </description>
  1170. </property>
  1171.  
  1172. <property>
  1173. <name>plugin.includes</name>
  1174. <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  1175. <description>Regular expression naming plugin directory names to
  1176. include. Any plugin not matching this expression is excluded.
  1177. In any case you need at least include the nutch-extensionpoints plugin. By
  1178. default Nutch includes crawling just HTML and plain text via HTTP,
  1179. and basic indexing and search plugins. In order to use HTTPS please enable
  1180. protocol-httpclient, but be aware of possible intermittent problems with the
  1181. underlying commons-httpclient library.
  1182. 配置插件功能的配置项,plugin.includes表示需要加载的插件列表
  1183. </description>
  1184. </property>
  1185.  
  1186. <property>
  1187. <name>plugin.excludes</name>
  1188. <value></value>
  1189. <description>Regular expression naming plugin directory names to exclude.
  1190. </description>
  1191. </property>
  1192.  
  1193. <property>
  1194. <name>urlmeta.tags</name>
  1195. <value></value>
  1196. <description>
  1197. To be used in conjunction with features introduced in NUTCH-655, which allows
  1198. for custom metatags to be injected alongside your crawl URLs. Specifying those
  1199. custom tags here will allow for their propagation into a pages outlinks, as
  1200. well as allow for them to be included as part of an index.
  1201. Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags with
  1202. white-space at their boundaries, if you are using anything earlier than Hadoop-0.21.
  1203. </description>
  1204. </property>
  1205.  
  1206. <!-- parser properties -->
  1207.  
  1208. <property>
  1209. <name>parse.plugin.file</name>
  1210. <value>parse-plugins.xml</value>
  1211. <description>The name of the file that defines the associations between
  1212. content-types and parsers.</description>
  1213. </property>
  1214.  
  1215. <property>
  1216. <name>parser.character.encoding.default</name>
  1217. <value>windows-1252</value>
  1218. <description>The character encoding to fall back to when no other information
  1219. is available
  1220. 解析文档的时候使用的默认编码windows-1252,如果文档中解析不到编码,则使用默认编码
  1221. </description>
  1222. </property>
  1223.  
  1224. <property>
  1225. <name>encodingdetector.charset.min.confidence</name>
  1226. <value>-1</value>
  1227. <description>A integer between 0-100 indicating minimum confidence value
  1228. for charset auto-detection. Any negative value disables auto-detection.
  1229. </description>
  1230. </property>
  1231.  
  1232. <property>
  1233. <name>parser.caching.forbidden.policy</name>
  1234. <value>content</value>
  1235. <description>If a site (or a page) requests through its robot metatags
  1236. that it should not be shown as cached content, apply this policy. Currently
  1237. three keywords are recognized: "none" ignores any "noarchive" directives.
  1238. "content" doesn't show the content, but shows summaries (snippets).
  1239. "all" doesn't show either content or summaries.</description>
  1240. </property>
  1241.  
  1242. <property>
  1243. <name>parser.html.impl</name>
  1244. <value>neko</value>
  1245. <description>HTML Parser implementation. Currently the following keywords
  1246. are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.
  1247. 制定解析HTML文档的时候使用的解析器,NEKO功能比较强大,
  1248. 后面会有专门的文章介绍Neko从HTML到TEXT以及html片断的解析等功能做介绍
  1249. </description>
  1250. </property>
  1251.  
  1252. <property>
  1253. <name>parser.html.form.use_action</name>
  1254. <value>false</value>
  1255. <description>If true, HTML parser will collect URLs from form action
  1256. attributes. This may lead to undesirable behavior (submitting empty
  1257. forms during next fetch cycle). If false, form action attribute will
  1258. be ignored.</description>
  1259. </property>
  1260.  
  1261. <property>
  1262. <name>parser.html.outlinks.ignore_tags</name>
  1263. <value></value>
  1264. <description>Comma separated list of HTML tags, from which outlinks
  1265. shouldn't be extracted. Nutch takes links from: a, area, form, frame,
  1266. iframe, script, link, img. If you add any of those tags here, it
  1267. won't be taken. Default is empty list. Probably reasonable value
  1268. for most people would be "img,script,link".</description>
  1269. </property>
  1270.  
  1271. <property>
  1272. <name>htmlparsefilter.order</name>
  1273. <value></value>
  1274. <description>The order by which HTMLParse filters are applied.
  1275. If empty, all available HTMLParse filters (as dictated by properties
  1276. plugin-includes and plugin-excludes above) are loaded and applied in system
  1277. defined order. If not empty, only named filters are loaded and applied
  1278. in given order.
  1279. HTMLParse filter ordering MAY have an impact
  1280. on end result, as some filters could rely on the metadata generated by a previous filter.
  1281. </description>
  1282. </property>
  1283.  
  1284. <property>
  1285. <name>parser.timeout</name>
  1286. <value>30</value>
  1287. <description>Timeout in seconds for the parsing of a document, otherwise treats it as an exception and
  1288. moves on the the following documents. This parameter is applied to any Parser implementation.
  1289. Set to -1 to deactivate, bearing in mind that this could cause
  1290. the parsing to crash because of a very long or corrupted document.
  1291. </description>
  1292. </property>
  1293.  
  1294. <property>
  1295. <name>parse.filter.urls</name>
  1296. <value>true</value>
  1297. <description>Whether the parser will filter URLs (with the configured URL filters).</description>
  1298. </property>
  1299.  
  1300. <property>
  1301. <name>parse.normalize.urls</name>
  1302. <value>true</value>
  1303. <description>Whether the parser will normalize URLs (with the configured URL normalizers).</description>
  1304. </property>
  1305.  
  1306. <property>
  1307. <name>parser.skip.truncated</name>
  1308. <value>true</value>
  1309. <description>Boolean value for whether we should skip parsing for truncated documents. By default this
  1310. property is activated due to extremely high levels of CPU which parsing can sometimes take.
  1311. </description>
  1312. </property>
  1313.  
  1314. <!--
  1315. <property>
  1316. <name>tika.htmlmapper.classname</name>
  1317. <value>org.apache.tika.parser.html.IdentityHtmlMapper</value>
  1318. <description>Classname of Tika HTMLMapper to use. Influences the elements included in the DOM and hence
  1319. the behaviour of the HTMLParseFilters.
  1320. </description>
  1321. </property>
  1322. -->
  1323.  
  1324. <property>
  1325. <name>tika.uppercase.element.names</name>
  1326. <value>true</value>
  1327. <description>Determines whether TikaParser should uppercase the element name while generating the DOM
  1328. for a page, as done by Neko (used per default by parse-html)(see NUTCH-1592).
  1329. </description>
  1330. </property>
  1331.  
  1332. <!-- urlfilter plugin properties -->
  1333.  
  1334. <property>
  1335. <name>urlfilter.domain.file</name>
  1336. <value>domain-urlfilter.txt</value>
  1337. <description>Name of file on CLASSPATH containing either top level domains or
  1338. hostnames used by urlfilter-domain (DomainURLFilter) plugin.</description>
  1339. </property>
  1340.  
  1341. <property>
  1342. <name>urlfilter.regex.file</name>
  1343. <value>regex-urlfilter.txt</value>
  1344. <description>Name of file on CLASSPATH containing regular expressions
  1345. used by urlfilter-regex (RegexURLFilter) plugin.</description>
  1346. </property>
  1347.  
  1348. <property>
  1349. <name>urlfilter.automaton.file</name>
  1350. <value>automaton-urlfilter.txt</value>
  1351. <description>Name of file on CLASSPATH containing regular expressions
  1352. used by urlfilter-automaton (AutomatonURLFilter) plugin.</description>
  1353. </property>
  1354.  
  1355. <property>
  1356. <name>urlfilter.prefix.file</name>
  1357. <value>prefix-urlfilter.txt</value>
  1358. <description>Name of file on CLASSPATH containing url prefixes
  1359. used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
  1360. </property>
  1361.  
  1362. <property>
  1363. <name>urlfilter.suffix.file</name>
  1364. <value>suffix-urlfilter.txt</value>
  1365. <description>Name of file on CLASSPATH containing url suffixes
  1366. used by urlfilter-suffix (SuffixURLFilter) plugin.</description>
  1367. </property>
  1368.  
  1369. <property>
  1370. <name>urlfilter.order</name>
  1371. <value></value>
  1372. <description>The order by which url filters are applied.
  1373. If empty, all available url filters (as dictated by properties
  1374. plugin-includes and plugin-excludes above) are loaded and applied in system
  1375. defined order. If not empty, only named filters are loaded and applied
  1376. in given order. For example, if this property has value:
  1377. org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter
  1378. then RegexURLFilter is applied first, and PrefixURLFilter second.
  1379. Since all filters are AND'ed, filter ordering does not have impact
  1380. on end result, but it may have performance implication, depending
  1381. on relative expensiveness of filters.
  1382. </description>
  1383. </property>
  1384.  
  1385. <!-- scoring filters properties -->
  1386.  
  1387. <property>
  1388. <name>scoring.filter.order</name>
  1389. <value></value>
  1390. <description>The order in which scoring filters are applied. This
  1391. may be left empty (in which case all available scoring filters will
  1392. be applied in system defined order), or a space separated list of
  1393. implementation classes.
  1394. </description>
  1395. </property>
  1396.  
  1397. <!-- scoring-depth properties
  1398. Add 'scoring-depth' to the list of active plugins
  1399. in the parameter 'plugin.includes' in order to use it.
  1400. -->
  1401.  
  1402. <property>
  1403. <name>scoring.depth.max</name>
  1404. <value>1000</value>
  1405. <description>Max depth value from seed allowed by default.
  1406. Can be overriden on a per-seed basis by specifying "_maxdepth_=VALUE"
  1407. as a seed metadata. This plugin adds a "_depth_" metadatum to the pages
  1408. to track the distance from the seed it was found from.
  1409. The depth is used to prioritise URLs in the generation step so that
  1410. shallower pages are fetched first.
  1411. </description>
  1412. </property>
  1413.  
  1414. <!-- language-identifier plugin properties -->
  1415.  
  1416. <property>
  1417. <name>lang.analyze.max.length</name>
  1418. <value>2048</value>
  1419. <description> The maximum bytes of data to uses to indentify
  1420. the language (0 means full content analysis).
  1421. The larger is this value, the better is the analysis, but the
  1422. slowest it is.
  1423. 和语言有关系,分词的时候会用到,
  1424. </description>
  1425. </property>
  1426.  
  1427. <property>
  1428. <name>lang.extraction.policy</name>
  1429. <value>detect,identify</value>
  1430. <description>This determines when the plugin uses detection and
  1431. statistical identification mechanisms. The order in which the
  1432. detect and identify are written will determine the extraction
  1433. policy. Default case (detect,identify) means the plugin will
  1434. first try to extract language info from page headers and metadata,
  1435. if this is not successful it will try using tika language
  1436. identification. Possible values are:
  1437. detect
  1438. identify
  1439. detect,identify
  1440. identify,detect
  1441. </description>
  1442. </property>
  1443.  
  1444. <property>
  1445. <name>lang.identification.only.certain</name>
  1446. <value>false</value>
  1447. <description>If set to true with lang.extraction.policy containing identify,
  1448. the language code returned by Tika will be assigned to the document ONLY
  1449. if it is deemed certain by Tika.
  1450. </description>
  1451. </property>
  1452.  
  1453. <!-- index-static plugin properties -->
  1454.  
  1455. <property>
  1456. <name>index.static</name>
  1457. <value></value>
  1458. <description>
  1459. Used by plugin index-static to adds fields with static data at indexing time.
  1460. You can specify a comma-separated list of fieldname:fieldcontent per Nutch job.
  1461. Each fieldcontent can have multiple values separated by space, e.g.,
  1462. field1:value1.1 value1.2 value1.3,field2:value2.1 value2.2 ...
  1463. It can be useful when collections can't be created by URL patterns,
  1464. like in subcollection, but on a job-basis.
  1465. </description>
  1466. </property>
  1467.  
  1468. <!-- index-metadata plugin properties -->
  1469.  
  1470. <property>
  1471. <name>index.parse.md</name>
  1472. <value>metatag.description,metatag.keywords</value>
  1473. <description>
  1474. Comma-separated list of keys to be taken from the parse metadata to generate fields.
  1475. Can be used e.g. for 'description' or 'keywords' provided that these values are generated
  1476. by a parser (see parse-metatags plugin)
  1477. </description>
  1478. </property>
  1479.  
  1480. <property>
  1481. <name>index.content.md</name>
  1482. <value></value>
  1483. <description>
  1484. Comma-separated list of keys to be taken from the content metadata to generate fields.
  1485. </description>
  1486. </property>
  1487.  
  1488. <property>
  1489. <name>index.db.md</name>
  1490. <value></value>
  1491. <description>
  1492. Comma-separated list of keys to be taken from the crawldb metadata to generate fields.
  1493. Can be used to index values propagated from the seeds with the plugin urlmeta
  1494. </description>
  1495. </property>
  1496.  
  1497. <!-- index-geoip plugin properties -->
  1498. <property>
  1499. <name>index.geoip.usage</name>
  1500. <value>insightsService</value>
  1501. <description>
  1502. A string representing the information source to be used for GeoIP information
  1503. association. Either enter 'cityDatabase', 'connectionTypeDatabase',
  1504. 'domainDatabase', 'ispDatabase' or 'insightsService'. If you wish to use any one of the
  1505. Database options, you should make one of GeoIP2-City.mmdb, GeoIP2-Connection-Type.mmdb,
  1506. GeoIP2-Domain.mmdb or GeoIP2-ISP.mmdb files respectively available on the classpath and
  1507. available at runtime.
  1508. </description>
  1509. </property>
  1510.  
  1511. <property>
  1512. <name>index.geoip.userid</name>
  1513. <value></value>
  1514. <description>
  1515. The userId associated with the GeoIP2 Precision Services account.
  1516. </description>
  1517. </property>
  1518.  
  1519. <property>
  1520. <name>index.geoip.licensekey</name>
  1521. <value></value>
  1522. <description>
  1523. The license key associated with the GeoIP2 Precision Services account.
  1524. </description>
  1525. </property>
  1526.  
  1527. <!-- parse-metatags plugin properties -->
  1528. <property>
  1529. <name>metatags.names</name>
  1530. <value>description,keywords</value>
  1531. <description> Names of the metatags to extract, separated by ','.
  1532. Use '*' to extract all metatags. Prefixes the names with 'metatag.'
  1533. in the parse-metadata. For instance to index description and keywords,
  1534. you need to activate the plugin index-metadata and set the value of the
  1535. parameter 'index.parse.md' to 'metatag.description,metatag.keywords'.
  1536. </description>
  1537. </property>
  1538.  
  1539. <!-- Temporary Hadoop 0.17.x workaround. -->
  1540.  
  1541. <property>
  1542. <name>hadoop.job.history.user.location</name>
  1543. <value>${hadoop.log.dir}/history/user</value>
  1544. <description>Hadoop 0.17.x comes with a default setting to create
  1545. user logs inside the output path of the job. This breaks some
  1546. Hadoop classes, which expect the output to contain only
  1547. part-XXXXX files. This setting changes the output to a
  1548. subdirectory of the regular log directory.
  1549. </description>
  1550. </property>
  1551.  
  1552. <!-- linkrank scoring properties -->
  1553.  
  1554. <property>
  1555. <name>link.ignore.internal.host</name>
  1556. <value>true</value>
  1557. <description>Ignore outlinks to the same hostname.</description>
  1558. </property>
  1559.  
  1560. <property>
  1561. <name>link.ignore.internal.domain</name>
  1562. <value>true</value>
  1563. <description>Ignore outlinks to the same domain.</description>
  1564. </property>
  1565.  
  1566. <property>
  1567. <name>link.ignore.limit.page</name>
  1568. <value>true</value>
  1569. <description>Limit to only a single outlink to the same page.</description>
  1570. </property>
  1571.  
  1572. <property>
  1573. <name>link.ignore.limit.domain</name>
  1574. <value>true</value>
  1575. <description>Limit to only a single outlink to the same domain.</description>
  1576. </property>
  1577.  
  1578. <property>
  1579. <name>link.analyze.num.iterations</name>
  1580. <value>10</value>
  1581. <description>The number of LinkRank iterations to run.</description>
  1582. </property>
  1583.  
  1584. <property>
  1585. <name>link.analyze.initial.score</name>
  1586. <value>1.0f</value>
  1587. <description>The initial score.</description>
  1588. </property>
  1589.  
  1590. <property>
  1591. <name>link.analyze.damping.factor</name>
  1592. <value>0.85f</value>
  1593. <description>The damping factor.</description>
  1594. </property>
  1595.  
  1596. <property>
  1597. <name>link.delete.gone</name>
  1598. <value>false</value>
  1599. <description>Whether to delete gone pages from the web graph.</description>
  1600. </property>
  1601.  
  1602. <property>
  1603. <name>link.loops.depth</name>
  1604. <value>2</value>
  1605. <description>The depth for the loops algorithm.</description>
  1606. </property>
  1607.  
  1608. <property>
  1609. <name>link.score.updater.clear.score</name>
  1610. <value>0.0f</value>
  1611. <description>The default score for URL's that are not in the web graph.</description>
  1612. </property>
  1613.  
  1614. <property>
  1615. <name>mapreduce.fileoutputcommitter.marksuccessfuljobs</name>
  1616. <value>false</value>
  1617. <description>Hadoop >= 0.21 generates SUCCESS files in the output which can crash
  1618. the readers. This should not be an issue once Nutch is ported to the new MapReduce API
  1619. but for now this parameter should prevent such cases.
  1620. </description>
  1621. </property>
  1622.  
  1623. <!-- solr index properties -->
  1624. <property>
  1625. <name>solr.server.url</name>
  1626. <value>http://127.0.0.1:8983/solr/</value>
  1627. <description>
  1628. Defines the Solr URL into which data should be indexed using the
  1629. indexer-solr plugin.
  1630. </description>
  1631. </property>
  1632.  
  1633. <property>
  1634. <name>solr.mapping.file</name>
  1635. <value>solrindex-mapping.xml</value>
  1636. <description>
  1637. Defines the name of the file that will be used in the mapping of internal
  1638. nutch field names to solr index fields as specified in the target Solr schema.
  1639. </description>
  1640. </property>
  1641.  
  1642. <property>
  1643. <name>solr.commit.size</name>
  1644. <value>250</value>
  1645. <description>
  1646. Defines the number of documents to send to Solr in a single update batch.
  1647. Decrease when handling very large documents to prevent Nutch from running
  1648. out of memory. NOTE: It does not explicitly trigger a server side commit.
  1649. </description>
  1650. </property>
  1651.  
  1652. <property>
  1653. <name>solr.commit.index</name>
  1654. <value>true</value>
  1655. <description>
  1656. When closing the indexer, trigger a commit to the Solr server.
  1657. </description>
  1658. </property>
  1659.  
  1660. <property>
  1661. <name>solr.auth</name>
  1662. <value>false</value>
  1663. <description>
  1664. Whether to enable HTTP basic authentication for communicating with Solr.
  1665. Use the solr.auth.username and solr.auth.password properties to configure
  1666. your credentials.
  1667. </description>
  1668. </property>
  1669.  
  1670. <!-- Elasticsearch properties -->
  1671.  
  1672. <property>
  1673. <name>elastic.host</name>
  1674. <value></value>
  1675. <description>The hostname to send documents to using TransportClient. Either host
  1676. and port must be defined or cluster.</description>
  1677. </property>
  1678.  
  1679. <property>
  1680. <name>elastic.port</name>
  1681. <value>9300</value>The port to connect to using TransportClient.<description>
  1682. </description>
  1683. </property>
  1684.  
  1685. <property>
  1686. <name>elastic.cluster</name>
  1687. <value></value>
  1688. <description>The cluster name to discover. Either host and potr must be defined
  1689. or cluster.</description>
  1690. </property>
  1691.  
  1692. <property>
  1693. <name>elastic.index</name>
  1694. <value>nutch</value>
  1695. <description>Default index to send documents to.</description>
  1696. </property>
  1697.  
  1698. <property>
  1699. <name>elastic.max.bulk.docs</name>
  1700. <value>250</value>
  1701. <description>Maximum size of the bulk in number of documents.</description>
  1702. </property>
  1703.  
  1704. <property>
  1705. <name>elastic.max.bulk.size</name>
  1706. <value>2500500</value>
  1707. <description>Maximum size of the bulk in bytes.</description>
  1708. </property>
  1709.  
  1710. <!-- subcollection properties -->
  1711.  
  1712. <property>
  1713. <name>subcollection.default.fieldname</name>
  1714. <value>subcollection</value>
  1715. <description>
  1716. The default field name for the subcollections.
  1717. </description>
  1718. </property>
  1719.  
  1720. <!-- Headings plugin properties -->
  1721.  
  1722. <property>
  1723. <name>headings</name>
  1724. <value>h1,h2</value>
  1725. <description>Comma separated list of headings to retrieve from the document</description>
  1726. </property>
  1727.  
  1728. <property>
  1729. <name>headings.multivalued</name>
  1730. <value>false</value>
  1731. <description>Whether to support multivalued headings.</description>
  1732. </property>
  1733.  
  1734. <!-- mimetype-filter plugin properties -->
  1735.  
  1736. <property>
  1737. <name>mimetype.filter.file</name>
  1738. <value>mimetype-filter.txt</value>
  1739. <description>
  1740. The configuration file for the mimetype-filter plugin. This file contains
  1741. the rules used to allow or deny the indexing of certain documents.
  1742. </description>
  1743. </property>
  1744.  
  1745. </configuration>

regex-urlfilter解释.txt

 # Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License. # The default url filter.
# Better for whole-internet crawling. # Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# -表示不包含,+表示包含
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored. # skip file: ftp: and mailto: urls
-^(file|ftp|mailto): # skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ # skip URLs containing certain characters as probable queries, etc.
#表示过滤包含指定字符的URL,这样是抓取不到包含?*!@=等字符的URL的,建议改为: -[~]
#-[?*!@=]
-[~]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else
#+.
# 过滤正则表达式,([a-z0-9]*\.)*表示任意数字和字母,[\s\S]*表示任意字符
+^http://([a-z0-9]*\.)*bbs.superwu.cn/[\s\S]*
抓取discuz论坛中的数据
+^http://bbs.superwu.cn/forum.php$
+^http://bbs.superwu.cn/forum.php?mod=forumdisplay&fid=/d+$
+^http://bbs.superwu.cn/forum.php?mod=forumdisplay&fid=/d+&page=/d+$
+^http://bbs.superwu.cn/forum.php?mod=viewthread&tid=/d+&extra=page%3D/d+$
+^http://bbs.superwu.cn/forum.php?mod=viewthread&tid=/d+&extra=page%3D/d+&page=/d+$

Nutch的nutch-default.xml和regex-urlfilter.txt的中文解释的更多相关文章

  1. struts2 default.xml详解

    struts2  default.xml 内容 1 bean节点制定Struts在运行的时候创建的对象类型. 2 指定Struts-default 包  用户写的package(struts.xml) ...

  2. c#读取xml文件配置文件Winform及WebForm-Demo具体解释

    我这里用Winform和WebForm两种为例说明怎样操作xml文档来作为配置文件进行读取操作. 1.新建一个类,命名为"SystemConfig.cs".代码例如以下: < ...

  3. Nutch配置:nutch-default.xml详解

    /×××××××××××××××××××××××××××××××××××××××××/ Author:xxx0624 HomePage:http://www.cnblogs.com/xxx0624/ ...

  4. Using Xpath With Default XML Namespace in C#

    If you have a XML file without any prefix in the namespace: <bookstore xmlns="http://www.con ...

  5. FusionCharts制作报表使用XML导入数据时出现的中文乱码问题

    今天在使用FusionCharts制作报表时用XML导入数据,总是出现乱码问题,下面是我的解决方案. 让FusionCharts支持中文 刚刚将XML导入到html中后,在火狐浏览器一直报Invali ...

  6. C# 之三类文件的读写( .XML,.INI 和 .TXT 文件)

    笔记之用,关于三类.xml, .ini, .txt 文件的 C# 读写,请多多指教! 1,第一类:.xml 文件的读写 先贴上xml文件,下面对这个文件进行操作: <?xml version=& ...

  7. xml文件中 xmlns xmlns:xsi 等解释

    http://bbs.csdn.NET/topics/390751819 maven 的 pom.xml 开头是下面这样的 <project xmlns="http://maven.a ...

  8. 【Nutch2.3基础教程】集成Nutch/Hadoop/Hbase/Solr构建搜索引擎:安装及运行【集群环境】

    1.下载相关软件,并解压 版本号如下: (1)apache-nutch-2.3 (2) hadoop-1.2.1 (3)hbase-0.92.1 (4)solr-4.9.0 并解压至/opt/jedi ...

  9. 【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行【单机环境】

    1.下载相关软件,并解压 版本号如下: (1)apache-nutch-2.2.1 (2) hbase-0.90.4 (3)solr-4.9.0 并解压至/usr/search 2.Nutch的配置 ...

随机推荐

  1. C# 中的委托(Delegate)

    委托(Delegate) 是存有对某个方法的引用的一种引用类型变量.引用可在运行时被改变. 委托(Delegate)特别用于实现事件和回调方法.所有的委托(Delegate)都派生自 System.D ...

  2. git——^和~的区别(转)

    原文地址: http://www.cnblogs.com/softidea/p/4967607.html 一. 引子 在git操作中,我们可以使用checkout命令检出某个状态下文件,也可以使用re ...

  3. 最全js表单验证

    /***************************************************************** 表单校验工具类 (linjq) ***************** ...

  4. centos7,配置nginx服务器

    安装准备 首先由于nginx的一些模块依赖一些lib库,所以在安装nginx之前,必须先安装这些lib库,这些依赖库主要有g++.gcc.openssl-devel.pcre-devel和zlib-d ...

  5. Nios II——定制自己的IP1之Nios接口类型

    信号自动识别的接口前缀 接口前缀 接口类型 asi Avalon-ST宿端口(输入) aso Avalon-ST源端口(输出) avm Avalon-MM主端口 avs Avalon-MM从端口 ax ...

  6. (转)mysql command line client打不开(闪一下消失)的解决办法

    转自:http://www.2cto.com/database/201209/153858.html 网上搜索到的解决办法: 1.找到mysql安装目录下的bin目录路径. 2.打开cmd,进入到bi ...

  7. JS高程研读记录一【事件流】

    事件流主要有冒泡事件.事件捕获及DOM事件流.现浏览器除了IE8及更早版外,基本支持DOM事件流. 冒泡事件由IE提出,而事件捕获则由Netscape提出.但两者却是截然相反的方案. 以DIV点击为例 ...

  8. Hdu1281 棋盘游戏

    棋盘游戏 Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 65536/32768 K (Java/Others) Total Submi ...

  9. codeforce864d

    D. Make a Permutation! time limit per test 2 seconds memory limit per test 256 megabytes input stand ...

  10. 数字签名、数字证书的原理以及证书的获得java版

    数字签名原理简介(附数字证书) 首先要了解什么叫对称加密和非对称加密,消息摘要这些知识. 1. 非对称加密 在通信双方,如果使用非对称加密,一般遵从这样的原则:公钥加密,私钥解密.同时,一般一个密钥加 ...