JAVA爬虫挖取CSDN博客文章

开门见山，看看这个教程的主要任务，就去csdn博客，挖取技术文章，我以《第一行代码–安卓》的作者为例，将他在csdn发表的额博客信息都挖取出来。因为郭神是我在大学期间比较崇拜的对象之一。他的csdn首页如下:http://blog.csdn.net/guolin_blog,首页如图:

你需要掌握的技术有：java se，正则表达式，js dom编程思想，jsoup，此外还需要http协议的一些知识。其中其他技术点可能你以前就掌握了，只差一个jsoup了，这个哥们是干嘛使的呢？我用一句话来说，就是Java程序获取html之后，向js,jquery一样来解析dom节点的，很多语法与js/jquery十分的相似，比如getElementById,getElemementsByClass等等语法与js相似极了，此外jsoup还有select选择器功能。所以，这里只要稍微掌握jsoup语法就可以像js操作dom一样的用Java来操作请求到的html网页了。jsoup的官方教程地址:http://www.open-open.com/jsoup/。

开始之前，你应该有一定的工具的，不如已经有一款熟练的ide来调试和查看变量，有一个web调试工具，如火狐的firebug之类的，总之，就是有一个java web程序员日常开发和调试使用的工具就差不多了。

第一步：新建一个maven项目。这个maven项目可以是一个java se项目，选择jar包就行了。重点是引入jsoup所需要的jar包。pom.xml配置如下:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.shizongger.jsoup</groupId>
  <artifactId>jsoup-shizongger</artifactId>
  <version>0.0.1-SNAPSHOT</version>

  <dependencies>
    <dependency>
      <!-- jsoup HTML parser library @ http://jsoup.org/ -->
      <groupId>org.jsoup</groupId>
      <artifactId>jsoup</artifactId>
      <version>1.9.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.5.1</version>
    </dependency>
    <dependency>
        <groupId>org.mockito</groupId>
        <artifactId>mockito-all</artifactId>
        <version>1.8.4</version>
    </dependency>
    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.12</version>
    </dependency>
  </dependencies>

  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <configuration>
          <source>1.6</source>
          <target>1.6</target>
        </configuration>
      </plugin>
    </plugins>
</build>
</project>

第二步：新建一个带有main方法的类，当然你也可以不带main方法，不过我只是写一个java se小程序而已。(如果是配置了http代理的，按照这一步：配置http代理，这点很重要，不然请求不到html网页的，我们都知道，我们的浏览器都配置了代理了的，不然根本无法请求到html网页，不信你可以把IE浏览器的代理取消，去浏览百度一下看看)。设置http代理的代码如下:

        System.setProperty("http.maxRedirects", "50");
        System.getProperties().setProperty("proxySet", "true");
        // 如果不设置，只要代理IP和代理端口正确,此项不设置也可以
        String ip = "代理服务器地址";
        System.getProperties().setProperty("http.proxyHost", ip);
        System.getProperties().setProperty("http.proxyPort", "代理的端口");

上面ip是代理的ip地址，端口8080是代理的端口，参考的地方就是自己去IE浏览器的配置查看。

第三步：好了，如果是一般的网站，直接可以发送post或者get请求了，但是csdn还需要在http头发送几个信息，反正之前我没有发送这些头部信息，是请求不到的。用firebug分析数据，如图：

    private static final String URL = "http://blog.csdn.net/guolin_blog";

    Connection conn = Jsoup.connect(URL)
        .userAgent("Mozilla/5.0 (Windows NT 6.1; rv:47.0) Gecko/20100101                                Firefox/47.")
        .timeout(30000)
        .method(Connection.Method.GET);

第三步：到这里，可以得到一个Connection了，接着就可以获取一个html文档了。

Document doc = conn.get();

如图，html的内容已经加载进来了。

第四步：用jsoup获取所得到的dom节点，有必须时，还需要用到正则表达式，不过本篇博客不会用到正则。其他任务可能会用到，总之，java爬虫应该经常用到正则表达式的。

我用firebug看到所有的文章，都是显示在<div id="article_list" class="list"></div>内，而每篇文章都是在这个div下面的这个div:<div class="list_item article_item">,

每篇文章占据的div，完整的html元素如下:

<div class="list_item article_item">
        <div class="article_title">
         <span class="ico ico_type_Original"></span>

    <h1>
        <span class="link_title"><a href="/guolin_blog/article/details/51336415">
        Android提醒微技巧，你真的了解Dialog、Toast和Snackbar吗？
        </a></span>
    </h1>
</div>

        <div class="article_description">
Dialog和Toast所有人肯定都不会陌生的，这个我们平时用的实在是太多了。而Snackbar是Design Support库中提供的新控件，有些朋友可能已经用过了，有些朋友可能还没去了解。但是你真的知道什么时候应该使用Dialog，什么时候应该使用Toast，什么时候应该使用Snackbar吗？本篇文章中我们就来学习一下这三者使用的时机，另外还会介绍一些额外的技巧...        </div>
            <div class="article_manage">
             <span class="link_postdate">2016-07-26 07:55</span>

        <span class="link_view" title="阅读次数"><a href="/guolin_blog/article/details/51336415" title="阅读次数">阅读</a>(7458)</span>
        <span class="link_comments" title="评论次数"><a href="/guolin_blog/article/details/51336415#comments" title="评论次数" onclick="_gaq.push(['_trackEvent','function', 'onclick', 'blog_articles_pinglun'])">评论</a>(45)</span>

    </div>

        <div class="clear"></div>
    </div>

仔细分析一下，这个div中涵盖了文章的简介，阅读次数，连接地址等等，总之，这个div才是重头戏要获取的数据都在这呢。

    public static List<ArticleEntity> getArtitcleByPage(int pageNow) throws IOException{

        Connection conn = Jsoup.connect(URL + "/article/list/" + pageNow)
                .userAgent("Mozilla/5.0 (Windows NT 6.1; rv:47.0) Gecko/20100101 Firefox/47.")
                .timeout(30000)
                .method(Connection.Method.GET);
        Document doc = conn.get();
        Element body = doc.body();
        List<ArticleEntity> resultList = new ArrayList<ArticleEntity>();

        Element articleListDiv = body.getElementById("article_list");
        Elements articleList = articleListDiv.getElementsByClass("list_item");
        for(Element article : articleList){
            ArticleEntity articleEntity = new ArticleEntity();
            Element linkNode = (article.select("div h1 a")).get(0);
            Element desptionNode = (article.getElementsByClass("article_description")).get(0);
            Element articleManageNode = (article.getElementsByClass("article_manage")).get(0);

            articleEntity.setAddress(linkNode.attr("href"));
            articleEntity.setTitle(linkNode.text());
            articleEntity.setDesption(desptionNode.text());
            articleEntity.setTime(articleManageNode.select("span:eq(0").text());

            resultList.add(articleEntity);
        }
        return resultList;
    }

现在可以将当前页数的文章挖掘出来了，但是郭神的技术文章不止一页啊。就是传说中的，还要进行分页挖掘，首先我们来看看我们的Java是如何做分页的。虚算法如下：

分页存储过程或者页面分页中的分页算法:

int pagesize // 每页记录数

int recordcount // 总记录数

int pagecount // 总页数

pagecount=(recordcount+pagesize-1)/pagesize

此方法得出的结果为实际页码

pagecount=(recordcount-1)/pagesize

此方法得出的结果为实际页码-1

注:两个整数相除的结果始终是整数。例如：5除以2的结果为2。若要确定5除以2的余数，请使用modulo运算符(%)。若要获取作为有理数或分数的商，应将被除数或除数设置为float或double型。可以通过在数字后添加一个小数点来隐式执行此操作

那么，我们先来获取总记录数，总页数，每页有多少技术文章吧！好在csdn的信息都放在地下的div里面了呢，来来来，我们用firebug看一看。

<div id="papelist" class="pagelist">
<span> 94条  共7页</span><strong>1</strong> <a href="/sinyu890807/article/list/2">2</a> <a href="/sinyu890807/article/list/3">3</a> <a href="/sinyu890807/article/list/4">4</a> <a href="/sinyu890807/article/list/5">5</a> <a href="/sinyu890807/article/list/6">...</a> <a href="/sinyu890807/article/list/2">下一页</a> <a href="/sinyu890807/article/list/7">尾页</a>
    </div>

代码如下：

    private int totalPage;
    private int nowPage;

    /**
     *  despt:分页算法
     * @return
     * @throws IOException
     */
    public List<ArticleEntity> goPage() throws IOException{
        Connection conn = Jsoup.connect(URL)
                .userAgent("Mozilla/5.0 (Windows NT 6.1; rv:47.0) Gecko/20100101 Firefox/47.")
                .timeout(30000)
                .method(Connection.Method.GET);
        Document doc = conn.get();
        Element body = doc.body();

        //获取总页数
        String totalPage = body.getElementById("papelist").select("span:eq(0)").text();
        String regex = ".+共(\\d+)页";
        totalPage = totalPage.replaceAll(regex, "$1");
        this.totalPage = Integer.parseInt(totalPage);
        this.nowPage = 1;

        List<ArticleEntity> list = new ArrayList<ArticleEntity>();
        for(nowPage = 1; this.nowPage <= this.totalPage; this.nowPage++){
            list.addAll(getArtitcleByPage(this.nowPage));
        }
        return list;
    }

至此，就可以将郭霖的csdn技术博客都可以获取了。此时你只需要将得到的信息都封装好，在需要的时候调用就行了。

完整代码如下：

public class ArticleEntity {
    /**
     * ���µ�ַ����Ե�ַ
     */
    private String address;

    /**
     * ���±���
     */
    private String title;

    /**
     * ��������
     */
    private String desption;

    /**
     * ����ʱ��
     */
    private String time;

    public String getAddress() {
        return address;
    }

    public void setAddress(String address) {
        this.address = address;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getDesption() {
        return desption;
    }

    public void setDesption(String desption) {
        this.desption = desption;
    }

    public String getTime() {
        return time;
    }

    public void setTime(String time) {
        this.time = time;
    }
}

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;

import com.shizongger.entity.ArticleEntity;

public class MyBlog {
    private static final String URL = "http://blog.csdn.net/guolin_blog";
    private int totalPage;
    private int nowPage;

    public static void main(String[] args) {
        System.setProperty("http.maxRedirects", "50");
        System.getProperties().setProperty("proxySet", "true");
        // 如果不设置，只要代理IP和代理端口正确,此项不设置也可以
        String ip = "172.17.18.80";
        System.getProperties().setProperty("http.proxyHost", ip);
        System.getProperties().setProperty("http.proxyPort", "8080");

        MyBlog demo = new MyBlog();
        try {
            List<ArticleEntity> list = demo.goPage();
            for(ArticleEntity tmp : list) {
                System.out.println("文章地址:" + tmp.getAddress());
                System.out.println("文章标题:" + tmp.getTitle());
                System.out.println("文章简介:" + tmp.getDesption());
                System.out.println("发表时间:" + tmp.getTime());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     *  despt:分页算法
     * @return
     * @throws IOException
     */
    public List<ArticleEntity> goPage() throws IOException{
        Connection conn = Jsoup.connect(URL)
                .userAgent("Mozilla/5.0 (Windows NT 6.1; rv:47.0) Gecko/20100101 Firefox/47.")
                .timeout(30000)
                .method(Connection.Method.GET);
        Document doc = conn.get();
        Element body = doc.body();

        //获取总页数
        String totalPage = body.getElementById("papelist").select("span:eq(0)").text();
        String regex = ".+共(\\d+)页";
        totalPage = totalPage.replaceAll(regex, "$1");
        this.totalPage = Integer.parseInt(totalPage);
        this.nowPage = 1;

        List<ArticleEntity> list = new ArrayList<ArticleEntity>();
        for(nowPage = 1; this.nowPage <= this.totalPage; this.nowPage++){
            list.addAll(getArtitcleByPage(this.nowPage));
        }
        return list;
    }

    public static List<ArticleEntity> getArtitcleByPage(int pageNow) throws IOException{

        Connection conn = Jsoup.connect(URL + "/article/list/" + pageNow)
                .userAgent("Mozilla/5.0 (Windows NT 6.1; rv:47.0) Gecko/20100101 Firefox/47.")
                .timeout(30000)
                .method(Connection.Method.GET);
        Document doc = conn.get();
        Element body = doc.body();
        List<ArticleEntity> resultList = new ArrayList<ArticleEntity>();

        Element articleListDiv = body.getElementById("article_list");
        Elements articleList = articleListDiv.getElementsByClass("list_item");
        for(Element article : articleList){
            ArticleEntity articleEntity = new ArticleEntity();
            Element linkNode = (article.select("div h1 a")).get(0);
            Element desptionNode = (article.getElementsByClass("article_description")).get(0);
            Element articleManageNode = (article.getElementsByClass("article_manage")).get(0);

            articleEntity.setAddress(linkNode.attr("href"));
            articleEntity.setTitle(linkNode.text());
            articleEntity.setDesption(desptionNode.text());
            articleEntity.setTime(articleManageNode.select("span:eq(0").text());

            resultList.add(articleEntity);
        }
        return resultList;
    }
}

在控制台已经输出想要的信息，这里截取一段吧，

文章地址:/guolin_blog/article/details/51336415
文章标题:Android提醒微技巧，你真的了解Dialog、Toast和Snackbar吗？
文章简介:Dialog和Toast所有人肯定都不会陌生的，这个我们平时用的实在是太多了。而Snackbar是Design Support库中提供的新控件，有些朋友可能已经用过了，有些朋友可能还没去了解。但是你真的知道什么时候应该使用Dialog，什么时候应该使用Toast，什么时候应该使用Snackbar吗？本篇文章中我们就来学习一下这三者使用的时机，另外还会介绍一些额外的技巧...
发表时间:2016-07-26 07:55

JAVA爬虫挖取CSDN博客文章的更多相关文章

Python爬虫简单实现CSDN博客文章标题列表
Python爬虫简单实现CSDN博客文章标题列表操作步骤: 分析接口,怎么获取数据? 模拟接口,尝试提取数据封装接口函数,实现函数调用. 1.分析接口打开Chrome浏览器,开启开发者工具(F1 ...
Python爬取CSDN博客文章
0 url :http://blog.csdn.net/youyou1543724847/article/details/52818339Redis一点基础的东西目录 1.基础底层数据结构 2.win ...
Hello Python!用 Python 写一个抓取 CSDN 博客文章的简单爬虫
网络上一提到 Python,总会有一些不知道是黑还是粉的人大喊着:Python 是世界上最好的语言.最近利用业余时间体验了下 Python 语言,并写了个爬虫爬取我 csdn 上关注的几个大神的博客, ...
Python爬虫抓取csdn博客
昨天晚上为了下载保存某位csdn大牛的所有博文,写了一个爬虫来自己主动抓取文章并保存到txt文本,当然也能够保存到html网页中. 这样就能够不用Ctrl+C 和Ctrl+V了,很方便.抓取别的站点 ...
python 爬虫爬取序列博客文章列表
python中写个爬虫真是太简单了 import urllib.request from pyquery import PyQuery as PQ # 根据URL获取内容并解码为UTF-8 def g ...
利用爬虫爬取指定用户的CSDN博客文章转为md格式，目的是完成博客迁移博文到Hexo等静态博客
文章目录功能爬取的方式: 设置生成的md文件命名规则: 设置md文件的头部信息是否显示csdn中的锚点"文章目录"字样,以及下面具体的锚点默认false(因为csdn中是集 ...
windows下使用python的scrapy爬虫框架，爬取个人博客文章内容信息
scrapy作为流行的python爬虫框架,简单易用,这里简单介绍如何使用该爬虫框架爬取个人博客信息.关于python的安装和scrapy的安装配置请读者自行查阅相关资料,或者也可以关注我后续的内容. ...
Python 爬取CSDN博客频道
初次接触python,写的很简单,开发工具PyCharm,python 3.4很方便 python 部分模块安装时需要其他的附属模块之类的,可以先 pip install wheel 然后可以直接下载 ...
CSDN博客文章的备份及导出电子书CHM
需要用到的工具集合下载:http://download.csdn.net/source/2881423 在CSDN.百度等写博客文章的应该很多,很多时候担心服务器有一天突然挂了,或者担心自己的号被封了 ...

随机推荐

20款时尚的 WordPress 企业模板【免费主题下载】
在这篇文章中,我们收集了20款时尚的 WordPress 企业模板.WordPress 作为最流行的博客系统,插件众多,易于扩充功能.安装和使用都非常方便,而且有许多第三方开发的免费模板,安装方式简单 ...
前端安全之XSS攻击
XSS(cross-site scripting跨域脚本攻击)攻击是最常见的Web攻击,其重点是“跨域”和“客户端执行”.有人将XSS攻击分为三种,分别是: 1. Reflected XSS(基于反射 ...
git 设置 key 到服务器，同步代码不需要输入用户名和密码
1 ssh-keygen -t rsa 2 vim ~/.ssh/id_rsa.pub 3. 添加到git 服务器,这样同步代码就不需要输入密码
Swift tour
输出函数: print(“hello world!") 无需引入函数库,无须使用“;”作为语句结尾,也无须写跟其它语言一样的main()函数,Swift中,全局区的代码就是程序入口.You ...
iOS 子视图超出父视图范围点击事件处理!
- (UIView *)hitTest:(CGPoint)point withEvent:(UIEvent *)event{ UIView *view = [super hitTest:point ...
【代码笔记】iOS-GTMBase64
一,工程文件. 二,代码. - (void)viewDidLoad { [super viewDidLoad]; // Do any additional setup after loading th ...
【读书笔记】iOS-使用Web Service-基于客户端服务器结构的网络通信（一）
Web Service技术是一种通过Web协议提供服务,保证不同平台的应用服务可以互操作,为客户端程序提供不同的服务. 目前3种主流的Web Service实现方案用:REST,SOAP和XML-RP ...
CSS 行内样式页内样式外部样式
行内标签: <!DOCTYPE html> <html> <head lang="en"> <meta charset="UTF ...
[Android] android studio 2.0即时运行功能探秘
即时运行instant Run是android studio 2中,开发人员最关心的特性之一在google发布studio 2.0之后,马上更新体验了一把,然而发现,并没快多少,说好的即时运行呢? ...
Asp.net禁用页面缓存的方法总结
1.在Asp页面首部<head>加入复制代码代码如下: Response.Buffer = True Response.ExpiresAbsolute = ...

JAVA爬虫挖取CSDN博客文章

JAVA爬虫挖取CSDN博客文章的更多相关文章

随机推荐

热门专题