java编写的一段简单的网络爬虫demo代码
功能:
从网站上下载附件,并从页面中提取页面文章内容
关于NIO
在大多数情况下,Java 应用程序并非真的受着 I/O 的束缚。操作系统并非不能快速传送
数据,让 Java 有事可做;相反,是 JVM 自身在 I/O 方面效率欠佳。操作系统与 Java 基于流的 I/O
模型有些不匹配。操作系统要移动的是大块数据(缓冲区),这往往是在硬件直接存储器存取
(DMA)的协助下完成的。而 JVM 的 I/O 类喜欢操作小块数据——单个字节、几行文本。结果,
操作系统送来整缓冲区的数据,java.io 的流数据类再花大量时间把它们拆成小块,往往拷贝一
个小块就要往返于几层对象。操作系统喜欢整卡车地运来数据,java.io 类则喜欢一铲子一铲子
地加工数据。有了 NIO,就可以轻松地把一卡车数据备份到您能直接使用的地方(ByteBuffer 对
象)。
以下代码使用了Java NIO以便提高文件读写效率。java NIO与原始IO差别,可以阅读《Java NIO》中文版了解。
使用Xpath抓取文章中特定dom节点的内容。
代码如下(已测试,可用,注意修改具体被爬行网站的接口):
import com.fasterxml.jackson.databind.ObjectMapper;
import com.google.common.collect.ImmutableMap;
import com.google.common.io.ByteStreams;
import lombok.extern.slf4j.Slf4j;
import org.apache.commons.collections4.MapUtils;
import org.apache.commons.lang3.RegExUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.NameValuePair;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.ssl.SSLContextBuilder;
import org.apache.http.util.EntityUtils;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource; import javax.net.ssl.SSLContext;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import java.io.BufferedWriter;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.StandardOpenOption;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.StringJoiner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors; @Slf4j
public class Main {
private static Pattern pattern = Pattern.compile("<div class=\"frame_subhead\">[\\s\\S]*?</div>");
private static Pattern pattern2 = Pattern.compile("<table class=\"form_table\">[\\s\\S]*?</table>"); String base = "http://172.16.3.122:9000"; Set<String> attachments = new LinkedHashSet<>(); public static void main(String[] args) throws IOException { Main bootstrap = new Main(); String jsonArray = bootstrap.getResponseBody("http://172.16.3.122:9000/pdfs"); log.info("爬虫程序已启动..."); bootstrap.attachments.addAll(
bootstrap.getAttachments(jsonArray)
); boolean succeed = bootstrap.login("admin", "123456"); log.info("正在登陆网站获取cookie..."); if (succeed) {
List<String> sites = bootstrap.list();
sites.forEach(site -> {
try {
bootstrap.crawl(site);
} catch (IOException | XPathExpressionException e) {
log.error("出错网站:{},{}",site,e);
System.err.println("出错网站:"+site);
e.printStackTrace();
}
}); }
} List<String> getAttachments(String rawArray) throws IOException {
ObjectMapper objectMapper = new ObjectMapper();
String[] attachments = objectMapper.readValue(rawArray, String[].class);
return Arrays.asList(attachments);
} void crawl(String site) throws IOException, XPathExpressionException { Path path = Paths.get(String.format("d:/download/%s", site));
char[] chars = path.getFileName().toFile().getName().toCharArray(); if ((int) chars[0] != 8195) { if (!Files.exists(path)) {
Files.createDirectories(path);
} downloadAttachment(site); List<Integer> array = Arrays.asList(3, 5, 6,7, 8);
for (int i = 1; i <= 11; i++) {
///昆山吉山会津塑料工业股份有限公司/html/1.html
String url = String.format("%s/%s/html/%d.html", base, site, i);
String html = getResponseBody(url);
if (StringUtils.isNotBlank(html)) {
String pattern = "tabContent_" + i;
int start = html.indexOf(pattern);
String title = extractSubTitle(start, html);
Path file = Paths.get(String.format("d:/download/%s/%s.txt", site, title));
if (array.contains(i)) {
saveFile(start, file, html);
}
}
}
}
} void xQuery(String text,Path path) throws IOException, XPathExpressionException {
String xml = text.substring(text.indexOf("<tbody>")); StringJoiner joiner = new StringJoiner("","<root>","</root>");
InputSource inputXML = new InputSource( new StringReader( joiner.add(xml).toString() ) ); XPath xPath = XPathFactory.newInstance().newXPath(); NodeList tBodyNodes = (NodeList) xPath.evaluate("/root/tbody", inputXML, XPathConstants.NODESET); try (BufferedWriter writer = Files.newBufferedWriter(path, Charset.defaultCharset(), StandardOpenOption.CREATE)) {
for (int i = 0; i < tBodyNodes.getLength(); i++) {
Node node = tBodyNodes.item(i); NodeList trNodes = (NodeList) xPath.evaluate("tr", node, XPathConstants.NODESET);
for (int k = 0; k < trNodes.getLength(); k++) { NodeList childList = (NodeList) xPath.evaluate("td", trNodes.item(k), XPathConstants.NODESET);
for (int j = 0; j < childList.getLength(); j++) {
Node child = childList.item(j);
String content = child.getTextContent(); writer.write(content);
if (j <childList.getLength() - 1) {
writer.write("\t");
}
}
writer.write("\r\n");
} writer.write("\r\n");
}
}
} void saveFile(int start,Path path, String html) throws XPathExpressionException, IOException {
Matcher matcher = pattern2.matcher(html);
int step = 0;
String tableText = "";
while (step++ < 1 && matcher.find(start)) {
tableText = RegExUtils.replacePattern(matcher.group(), "<table class=\"form_table\">|</table>", "").trim();
}
xQuery(tableText,path);
} void downloadAttachment(String site) {
List<String> list = attachments.stream().filter(name -> name.startsWith(site)).collect(Collectors.toList()); list.forEach(name -> { String filename = name.substring(name.lastIndexOf("/") + 1);
log.info("正在下载 --{} -附件:{}", site, filename); String url = base + "/" + name;
String dest = "d:/download/" + site + "/" + filename; Path file = Paths.get(dest).toAbsolutePath().normalize(); if (!Files.exists(file)) { Path path = file.getParent(); if (!Files.exists(path)) {
log.info("首次下载,正在创建目录:{}",path);
try {
Files.createDirectories(path);
} catch (IOException e) {
log.error("目录创建失败:{}",e);
}
} log.info("正在保存采集来的附件,保存到:{}",file);
try (FileChannel fc = new FileOutputStream(dest).getChannel()) {
ByteBuffer buffer = getResponseAttachment(url);
fc.write(buffer);
log.info("文件{}已经成功保存",file);
} catch (IOException e) {
log.error("文件{}保存出错:{}",file,e);
}
}
}); } List<String> list() throws IOException {
String url = base + "/%E5%88%97%E8%A1%A8%E9%A1%B5%E9%9D%A2/%E6%B1%9F%E8%8B%8F%E7%9C%81%E9%AB%98%E6%96%B0%E6%8A%80%E6%9C%AF%E4%BC%81%E4%B8%9A%E8%BE%85%E5%8A%A9%E6%9D%90%E6%96%99%E6%8F%90%E4%BA%A4%E7%B3%BB%E7%BB%9F_files/Dir_Main.html";
return Files.list(Paths.get("E:\\pdf"))
.map(path -> path.getFileName().toFile().getName())
.filter(path -> (!path.startsWith(" ")) && !path.startsWith(" "))
.filter(dirName -> {
return !Arrays.asList("登录网页", "列表页面").contains(dirName);
}).collect(Collectors.toList());
} boolean login(String username, String password) {
String url = base + "/index.html";
ImmutableMap<String, String> map = ImmutableMap.<String, String>builder()
.put("username", "admin")
.put("password", "123456")
.build();
try {
HttpResponse response = doPost(url, null, map);
return true;
} catch (IOException e) {
log.error("登录出错:{}", e);
;
return false;
}
} /**
* 信任SSL证书
*
* @return
*/
public CloseableHttpClient buildDefaultHttpClientTrustSSL() {
SSLContext sslContext = null;
try {
sslContext = SSLContextBuilder.create().useProtocol(SSLConnectionSocketFactory.SSL).loadTrustMaterial((x, y) -> true).build();
} catch (Exception e) {
e.printStackTrace();
}
RequestConfig config = RequestConfig.custom()
.setSocketTimeout(30000)
.setConnectTimeout(30000)
.setConnectionRequestTimeout(30000)
.setContentCompressionEnabled(true)
.build();
return HttpClientBuilder.create().setDefaultRequestConfig(config).setSSLContext(sslContext).setSSLHostnameVerifier((x, y) -> true).build();
} /***
* 从响应的报文中提取网站标题
* @param responseBody
* @return
*/
public String extractSubTitle(int start, String responseBody) {
Matcher matcher = pattern.matcher(responseBody);
int i = 0;
String subHead = "";
while (i++ < 1 && matcher.find(start)) {
subHead = StringUtils.replacePattern(matcher.group(), "<div class=\"frame_subhead\">|</div>", "").trim();
} int offset1 = subHead.indexOf("、");
if (offset1 >= 0) {
subHead = subHead.substring(offset1 + 1);
} return subHead; } public String extract(String body, String pattern) {
Pattern regex = Pattern.compile(pattern);
return "";
} HttpResponse doGet(String url, Map<String, String> headerRefs) throws IOException {
//巡检时更改为信任证书
CloseableHttpClient httpClient = buildDefaultHttpClientTrustSSL(); HttpGet httpGet = new HttpGet(url);
httpGet.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64)spider");
httpGet.addHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8");
httpGet.addHeader("Accept-Encoding", "gzip, deflate");
httpGet.addHeader("Accept-Language", "zh-CN,zh;q=0.9"); if (MapUtils.isNotEmpty(headerRefs)) {
for (Map.Entry<String, String> entry : headerRefs.entrySet()) {
String name = entry.getKey();
String value = entry.getValue();
httpGet.setHeader(name, value);
}
} return httpClient.execute(httpGet);
} HttpResponse doPost(String url, Map<String, String> headerRefs, Map<String, String> data) throws IOException {
//巡检时更改为信任证书
CloseableHttpClient httpClient = buildDefaultHttpClientTrustSSL(); HttpPost httpPost = new HttpPost(url);
httpPost.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) spider");
httpPost.addHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8");
httpPost.addHeader("Accept-Encoding", "gzip, deflate");
httpPost.addHeader("Accept-Language", "zh-CN,zh;q=0.9"); if (MapUtils.isNotEmpty(headerRefs)) {
for (Map.Entry<String, String> entry : headerRefs.entrySet()) {
String name = entry.getKey();
String value = entry.getValue();
httpPost.setHeader(name, value);
}
} if (MapUtils.isNotEmpty(data)) { List<NameValuePair> nvps = new ArrayList<NameValuePair>(); for (Map.Entry<String, String> entry : data.entrySet()) {
String name = entry.getKey();
String value = entry.getValue();
nvps.add(new BasicNameValuePair(name, value));
}
httpPost.setEntity(new UrlEncodedFormEntity(nvps));
}
return httpClient.execute(httpPost);
} /***
* 下载附件
* @param url
* @param headerRefs
* @return
* @throws IOException
*/
ByteBuffer getResponseAttachment(String url, Map<String, String> headerRefs) throws IOException {
HttpResponse response = doGet(url, headerRefs);
HttpEntity entity = response.getEntity();
if (entity != null) {
try (InputStream responseStream = entity.getContent()) {
byte[] targetArray = ByteStreams.toByteArray(responseStream);
ByteBuffer bufferByte = ByteBuffer.wrap(targetArray);
return bufferByte;
}
}
return ByteBuffer.wrap(new byte[0]);
} ByteBuffer getResponseAttachment(String url) throws IOException {
return getResponseAttachment(url, null);
} /***
* 下载html响应报文主题(html代码)
* @param url
* @param headerRefs
* @param charset
* @return
* @throws IOException
*/
String getResponseBody(String url, Map<String, String> headerRefs, Charset charset) throws IOException {
HttpResponse response = doGet(url, headerRefs); int status = response.getStatusLine().getStatusCode();
if (status != 200) {
return "";
}
HttpEntity entity = response.getEntity();
if (entity != null) {
return EntityUtils.toString(entity, charset);
}
return "";
} String getResponseBody(String url, Charset charset) throws IOException {
return getResponseBody(url, null, charset);
} String getResponseBody(String url) throws IOException {
return getResponseBody(url, null, StandardCharsets.UTF_8);
} }
java编写的一段简单的网络爬虫demo代码的更多相关文章
- Java笔记7:最简单的网络请求Demo
一.服务器端 1 新建一个工程,建立一个名为MyRequest的工程. 2 FileàProject StructureàModulesà点击最右侧的“+”àLibraryàJava 找到Tomcat ...
- 关于使用Java实现的简单网络爬虫Demo
什么是网络爬虫? 网络爬虫又叫蜘蛛,网络蜘蛛是通过网页的链接地址来寻找网页,从网站某一个页面(通常是首页)开始,读取网页的内容,找到在网页中的其它链接地址,然后通过这些链接地址寻找下一个网页,这样一直 ...
- Java实现一个简单的网络爬虫
Java实现一个简单的网络爬虫 import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.FileWri ...
- Python:requests库、BeautifulSoup4库的基本使用(实现简单的网络爬虫)
Python:requests库.BeautifulSoup4库的基本使用(实现简单的网络爬虫) 一.requests库的基本使用 requests是python语言编写的简单易用的HTTP库,使用起 ...
- python3.4学习笔记(十三) 网络爬虫实例代码,使用pyspider抓取多牛投资吧里面的文章信息,抓取政府网新闻内容
python3.4学习笔记(十三) 网络爬虫实例代码,使用pyspider抓取多牛投资吧里面的文章信息PySpider:一个国人编写的强大的网络爬虫系统并带有强大的WebUI,采用Python语言编写 ...
- python3.4学习笔记(十四) 网络爬虫实例代码,抓取新浪爱彩双色球开奖数据实例
python3.4学习笔记(十四) 网络爬虫实例代码,抓取新浪爱彩双色球开奖数据实例 新浪爱彩双色球开奖数据URL:http://zst.aicai.com/ssq/openInfo/ 最终输出结果格 ...
- 在python3中使用urllib.request编写简单的网络爬虫
转自:http://www.cnblogs.com/ArsenalfanInECNU/p/4780883.html Python官方提供了用于编写网络爬虫的包 urllib.request, 我们主要 ...
- 采用requests库构建简单的网络爬虫
Date: 2019-06-09 Author: Sun 我们分析格言网 https://www.geyanw.com/, 通过requests网络库和bs4解析库进行爬取此网站内容. 项目操作步 ...
- 一只简单的网络爬虫(基于linux C/C++)————开篇
最近学习开发linux下的爬虫,主要是参考了该博客及其他一些网上的资料.网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息 ...
随机推荐
- RedisTemplate和StringRedisTemplate的区别
今天springboot项目中用redis的时候,遇到了一个问题,用RedisTemplate这个类向redis中存储数据的时候,明明数据存进去了,也可以取出来,但是rdm就是看不到key的值,网上的 ...
- vue-cli 创建项目不成功 原因为项目文件夹无node_modules文件 进行npm install不成功解决办法
不知道有没有童鞋出现过全局安装vue-cli是成功的,但是创建项目时命令行报了很多错误,如下 本来是需要按照提示依次切换到项目文件夹,再npm run dev 即可完成项目创建并启动的,但是又报了如下 ...
- js实现发送验证码倒计时效果
<!doctype html> <html lang="en"> <head> <meta charset="UTF-8&quo ...
- HTML 008 head
HTML <head> 查看在线实例 <title> - 定义了HTML文档的标题使用 <title> 标签定义HTML文档的标题 <base> - 定 ...
- MySQL数据库导入到SQL Server
EXEC master.dbo.sp_addlinkedserver @server = N'MYSQL2', @srvproduct=N'mySQL', @provider=N'MSDASQL', ...
- 上传1T文件
一般10M以下的文件上传通过设置Web.Config,再用VS自带的FileUpload控件就可以了,但是如果要上传100M甚至1G的文件就不能这样上传了.我这里分享一下我自己开发的一套大文件上传控件 ...
- 【HDU4622】Reincarnation
[HDU4622]Reincarnation 一眼似乎不可做,但发现\(strlen(x)\)很小,暴力\(O(n^2)\)预处理每个区间\((l,r)\),查询时\(O(1)\)输出就好了 #inc ...
- vue pc element-ui class
按需引入element-ui npm install babel-plugin-component -D 先安装这个 然后在babelrc中配置: 在plugins中加入红色框的那一部分 [ &q ...
- 如何利用awk计算文件某一列的平均值?
[root@master yjt]# cat yjt.sh #!/bin/bash awk -v field="$1" '{sum+=$field; n++;}END {if (n ...
- 编程微语 2019-Autumn
很多时候我们要的是[网页全屏],可是许多软件却做成了[浏览器全屏],不要一听到[全屏]就认为真的是传统意义上的全屏.拜托,老板(往往就是最大的产品经理).产品经理.程序员,想想,说清楚,做正确.某度文 ...