[root@ewanalysis ~]# nutch
Usage: nutch COMMAND
where COMMAND is one of:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
index run the plugin-based indexer on parsed batches
elasticindex run the elasticsearch indexer - DEPRECATED use the index command instead
solrindex run the solr indexer on parsed batches - DEPRECATED use the inde x command instead
solrdedup remove duplicates from solr
solrclean remove HTTP and documents from solr - DEPRECATED use the clean command instead
clean remove HTTP and documents and duplicates from indexing b ackends configured via plugins
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
webapp run a local Nutch web application
junit runs the given JUnit test
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>
nutch inject
Usage: InjectorJob <url_dir> [-crawlId <id>]
nutch generate
Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]
[-topN N]:选取前多少个链接,默认值为Long.MAX_VALUE
[-noNorm] :不激活normalizer插件规范化的url,默认是true
[-adddays numDays]: 添加 <numDays>到当前时间,配置crawling urls ,以将很快被爬取db.default.fetch.interval默认值为0。爬取结束时间在当前时间以前的。
nutch fetch
Usage: FetcherJob (<batchId> | -all) [-crawlId <id>] [-threads N] [-resume] [-numTasks N]
[-crawlId <id>]:
[-threads N]:运行的fetcher线程数默认值为 Configuration Key -> fetcher.threads.fetch -> 10
[-numTasks N]:如果N>0,则使用设定的N减少抓取任务(默认值:
nutch parse
Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]
[-crawlId <id>]:
nutch updatedb
Usage: DbUpdaterJob (<batchId> | -all) [-crawlId <id>] <batchId> - crawl identifier returned by Generator, or -all for all
generated batchId-s
-crawlId <id> - the id to prefix the schemas to operate on,
nutch index
Usage: IndexingJob (<batchId> | -all | -reindex) [-crawlId <id>]
