nutch如何修改regex-urlfilter.txt爬取符合条件的链接

例如我在爬取学生在线的时候，发现爬取不到特定的通知，例如《中粮福临门助学基金申请公告》，通过分析发现原来通知的链接被过滤掉了，下面对过滤url的配置文件regex-urlfilter.txt进行分析，以后如果需要修改可以根据自己的情况对该配置文件进行修改：

说明：配置文件中以“#”开头的行为注释，以“-" 开头的表示符合正则表达式就过滤掉，以“+”开头的表示符合正则表达式则保留。正则表达式中"^"表示字符串的开头，"$"表示字符串的结尾，"[]"表示集合。中文部分是我添加的注释

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# The default url filter.
# Better for whole-internet crawling.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file: ftp: and mailto: urls
#过滤掉file：ftp等不是html协议的链接
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
#过滤掉图片等格式的链接
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=] 过滤掉汗特殊字符的链接，因为要爬取更多的链接，所以修改过滤条件，使包含？=的链接不被过滤掉
-[*!@]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#过滤掉一些特殊格式的链接
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
#接受所有的链接，这里可以做自己的修改，是的只接受自己规定类型的链接

# Licensed to the Apache Software Foundation (ASF) under one or more

# contributor license agreements.  See the NOTICE file distributed with

# this work for additional information regarding copyright ownership.

# The ASF licenses this file to You under the Apache License, Version 2.0

# (the "License"); you may not use this file except in compliance with

# the License.  You may obtain a copy of the License at

#

#     http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.
The default url filter.
Better for whole-internet crawling.
Each non-comment, non-blank line contains a regular expression
prefixed by '+' or '-'.  The first matching pattern in the file
determines whether a URL is included or ignored.  If no pattern
matches, the URL is ignored.
skip file: ftp: and mailto: urls
过滤掉file：ftp等不是html协议的链接

-^(file|ftp|mailto):
skip image and other suffixes we can't yet parse
过滤掉图片等格式的链接

-.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
skip URLs containing certain characters as probable queries, etc.
-[?*!@=] 过滤掉汗特殊字符的链接，因为要爬取更多的链接，所以修改过滤条件，使包含？=的链接不被过滤掉

-[*!@]
skip URLs with slash-delimited segment that repeats 3+ times, to break loops
过滤掉一些特殊格式的链接

-.*(/[^/]+)/[/]+\1/[^/]+\1/
accept anything else
接受所有的链接，这里可以做自己的修改，是的只接受自己规定类型的链接

原因解释：因为爬取的公告链接为（http://www.online.sdu.edu.cn/news/article.php?pid=636514943），链接中含有？和=字符，所以被过滤特殊字符的正则表达式过滤掉，通过修改regex-urlfilter.txt配置文件（如上），最终可以爬取这类公告的链接。

nutch如何修改regex-urlfilter.txt爬取符合条件的链接的更多相关文章

Java爬虫爬取网站电影下载链接
之前有看过一段时间爬虫,了解了爬虫的原理,以及一些实现的方法,本项目完成于半年前,一直放在那里,现在和大家分享出来. 网络爬虫简单的原理就是把程序想象成为一个小虫子,一旦进去了一个大门,这个小虫子就像 ...
从0开始学爬虫8使用requests/pymysql和beautifulsoup4爬取维基百科词条链接并存入数据库
从0开始学爬虫8使用requests和beautifulsoup4爬取维基百科词条链接并存入数据库 Python使用requests和beautifulsoup4爬取维基百科词条链接并存入数据库参考 ...
Java分布式爬虫Nutch教程——导入Nutch工程，执行完整爬取
Java分布式爬虫Nutch教程--导入Nutch工程,执行完整爬取 by briefcopy · Published 2016年4月25日 · Updated 2016年12月11日在使用本教程之 ...
【Python3 爬虫】06_robots.txt查看网站爬取限制情况
大多数网站都会定义robots.txt文件来限制爬虫爬去信息,我们在爬去网站之前可以使用robots.txt来查看的相关限制信息例如: 我们以[CSDN博客]的限制信息为例子在浏览器输入:http ...
python爬虫——爬取小说 | 探索白子画和花千骨的爱恨情仇(转载)
转载出处:药少敏 ,感谢原作者清晰的讲解思路! 下述代码是我通过自己互联网搜索和拜读完此篇文章之后写出的具有同样效果的爬虫代码: from bs4 import BeautifulSoup imp ...
scrapy实例:爬取中国天气网
1.创建项目在你存放项目的目录下,按shift+鼠标右键打开命令行,输入命令创建项目: PS F:\ScrapyProject> scrapy startproject weather # w ...
python学习(23)requests库爬取猫眼电影排行信息
本文介绍如何结合前面讲解的基本知识,采用requests,正则表达式,cookies结合起来,做一次实战,抓取猫眼电影排名信息. 用requests写一个基本的爬虫排行信息大致如下图网址链接为ht ...
Python 爬取热词并进行分类数据分析-[热词分类+目录生成]
日期:2020.02.04 博客期:143 星期二 [本博客的代码如若要使用,请在下方评论区留言,之后再用(就是跟我说一声)] 所有相关跳转: a.[简单准备] b.[云图制作+数据导入] c.[ ...
python入门学习之Python爬取最新笔趣阁小说
Python爬取新笔趣阁小说,并保存到TXT文件中我写的这篇文章,是利用Python爬取小说编写的程序,这是我学习Python爬虫当中自己独立写的第一个程序,中途也遇到了一些困难,但是最后 ...

随机推荐

利用Selenium实现图片文件上传的两种方式介绍
在实现UI自动化测试过程中,有一类需求是实现图片上传,这种需求根据开发的实现方式,UI的实现方式也会不同. 一.直接利用Selenium实现这种方式是最简单的一种实现方式,但是依赖于开发的实现. 当 ...
[Node.js] Manage Configuration Values with Environment Variables
Storing configuration in files instead of the environment has many downsides, including mistakenly c ...
Android - 加入Android的OpenCV依赖库(Android Dependencies) 问题
加入Android的OpenCV依赖库(Android Dependencies) 问题本文地址: http://blog.csdn.net/caroline_wendy 假设想要加入OpenCV的 ...
SecureCRT学习之道：SecureCRT 经常使用技巧
快捷键: 1. ctrl + a : 移动光标到行首 2. ctrl + e :移动光标到行尾 3. ctrl + d :删除光标之后的一个字符 4. ctrl + w : 删除行首到当前光标所在位 ...
Ruby学习笔记（二）——从管道读取数据
在对文件名修改后,今天又给自己出了新的难题,想从实验结果中提取数据,并将其作为文件夹的名称.其中,比赛的主办方提供的评估算法是用perl写的,因此读取实验结果最为简单的想法自然是使用管道命令,即 ./ ...
java应用集锦9:httpclient4.2.2的几个常用方法，登录之后访问页面问题，下载文件
转账注明出处:http://renjie120.iteye.com/blog/1727933 在工作中要用到android,然后进行网络请求的时候,打算使用httpClient. 总结一下httpCl ...
Java-MyBatis：MyBatis3 | 日志
ylbtech-Java-MyBatis:MyBatis3 | 日志 1.返回顶部 1. 日志 Mybatis 的内置日志工厂提供日志功能,内置日志工厂将日志交给以下其中一种工具作代理: SLF4J ...
微信小程序发送模板消息
微信小程序发送模板消息标签(空格分隔): php 看小程序文档 [模板消息文档总览]:https://developers.weixin.qq.com/miniprogram/dev/framewo ...
Laravel异常处理
Laravel异常处理标签(空格分隔): php 自定义异常类 <?php namespace App\Exceptions; use Throwable; use Exception; cl ...
C＃.NET编码规范
一. 环境设置首先去除开发环境中的一些选项如下: 图一图二二. 命名规范 1) 通用性 l 标识的总长度不要超过32个字符. l 标识符的基本语法是以字母和_开始,由字母数字及下划线组成的单词 ...

nutch如何修改regex-urlfilter.txt爬取符合条件的链接

The default url filter.

Better for whole-internet crawling.

Each non-comment, non-blank line contains a regular expression

prefixed by '+' or '-'. The first matching pattern in the file

determines whether a URL is included or ignored. If no pattern

matches, the URL is ignored.

skip file: ftp: and mailto: urls

过滤掉file：ftp等不是html协议的链接

skip image and other suffixes we can't yet parse

过滤掉图片等格式的链接

skip URLs containing certain characters as probable queries, etc.

-[?*!@=] 过滤掉汗特殊字符的链接，因为要爬取更多的链接，所以修改过滤条件，使包含？=的链接不被过滤掉

skip URLs with slash-delimited segment that repeats 3+ times, to break loops

过滤掉一些特殊格式的链接

accept anything else

接受所有的链接，这里可以做自己的修改，是的只接受自己规定类型的链接

nutch如何修改regex-urlfilter.txt爬取符合条件的链接的更多相关文章

随机推荐

热门专题