爬虫任务一：使用httpclient去爬取百度新闻首页的新闻标题和url，编码是utf-8

第一个入手的爬虫小任务：

maven工程

<project xmlns="http://maven.apache.org/POM/4.0.0"

    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

    <modelVersion>4.0.0</modelVersion>

    <groupId>com.zhaowu</groupId>

    <artifactId>pachong01</artifactId>

    <version>0.0.1-SNAPSHOT</version>

    <dependencies>

        <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->

        <dependency>

            <groupId>org.apache.httpcomponents</groupId>

            <artifactId>httpclient</artifactId>

            <version>4.5.3</version>

        </dependency>

        <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->

        <dependency>

            <groupId>org.jsoup</groupId>

            <artifactId>jsoup</artifactId>

            <version>1.11.2</version>

        </dependency>

        <!-- https://mvnrepository.com/artifact/commons-io/commons-io -->

        <dependency>

            <groupId>commons-io</groupId>

            <artifactId>commons-io</artifactId>

            <version>2.6</version>

        </dependency>

    </dependencies>

</project>

代码实现：

package com.zhaowu.renwu1;

import java.io.IOException;

import org.apache.http.HttpEntity;

import org.apache.http.client.ClientProtocolException;

import org.apache.http.client.config.RequestConfig;

import org.apache.http.client.methods.CloseableHttpResponse;

import org.apache.http.client.methods.HttpGet;

import org.apache.http.impl.client.CloseableHttpClient;

import org.apache.http.impl.client.HttpClients;

import org.apache.http.util.EntityUtils;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import org.jsoup.nodes.Element;

import org.jsoup.select.Elements;

public class News {

    public static void main(String[] args) throws ClientProtocolException, IOException {

        // 创建HttpClient实例

        CloseableHttpClient httpClient = HttpClients.createDefault();

        // 创建httpget实例

        HttpGet httpGet = new HttpGet("https://news.baidu.com/");

        RequestConfig config = RequestConfig.custom()

                .setConnectTimeout(10000)//设置连接超时时间10秒钟，单位毫秒

                .setSocketTimeout(10000) //设置读取超时时间10秒钟

                .build();

        httpGet.setConfig(config);

        // 设置请求头消息User-Agent模拟浏览器

        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; W…) Gecko/20100101 Firefox/59.0");

        // 执行get请求

        CloseableHttpResponse response = httpClient.execute(httpGet);

        // 获取返回实体

        HttpEntity entity = response.getEntity();

        // 实体的内容（编码格式为utf-8）

        String content = EntityUtils.toString(entity, "utf-8");

        // System.out.println("网页内容为： " + content);

        // 解析网页 得到文档对象

        Document doc = Jsoup.parse(content);    

        Elements hrefElements = doc.select("a[href]");// 选择所有的a元素

        for (Element e : hrefElements) {

            System.out.println("新闻标题：" + e.text());

            System.out.println("新闻地址：" + e.attr("href"));

            System.out.println("------------------------");

        }

    }

}

爬虫任务一：使用httpclient去爬取百度新闻首页的新闻标题和url，编码是utf-8的更多相关文章

爬虫实战(一) 用Python爬取百度百科
最近博主遇到这样一个需求:当用户输入一个词语时,返回这个词语的解释我的第一个想法是做一个数据库,把常用的词语和词语的解释放到数据库里面,当用户查询时直接读取数据库结果但是自己又没有心思做这样一个数 ...
Python 爬虫实例（1）—— 爬取百度图片
爬取百度图片在Python 2.7上运行 #!/usr/bin/env python # -*- coding: utf-8 -*- # @Author: loveNight import jso ...
Python爬虫实例（一）爬取百度贴吧帖子中的图片
程序功能说明:爬取百度贴吧帖子中的图片,用户输入贴吧名称和要爬取的起始和终止页数即可进行爬取. 思路分析: 一.指定贴吧url的获取例如我们进入秦时明月吧,提取并分析其有效url如下 http:// ...
Python 爬虫实例（15）爬取百度百聘（微信公众号）
今天闲的无聊,爬取了一个网站,百度百聘,仅供学习参考直接上代码: #-*-coding:utf-8-*- from common.contest import * def spider(): hea ...
使用scrapy爬虫,爬取今日头条首页推荐新闻（scrapy+selenium+PhantomJS）
爬取今日头条https://www.toutiao.com/首页推荐的新闻,打开网址得到如下界面查看源代码你会发现全是js代码,说明今日头条的内容是通过js动态生成的. 用火狐浏览器F12查看得知 ...
python爬虫实战（2）--爬取百度贴吧
本篇目标 1.对百度贴吧的任意帖子进行抓取 2.指定是否只抓取楼主发帖内容 3.将抓取到的内容分析并保存到文件 1.URL格式的确定先观察百度贴吧url格式,以中南财经政法大学迎新帖为例,URL我们 ...
Python 爬虫实例（14）爬取百度音乐
#-*-coding:utf-8-*- from common.contest import * import urllib def spider(): song_types = ['新歌','热歌' ...
百度图片爬虫-python版-如何爬取百度图片?
上一篇我写了如何爬取百度网盘的爬虫,在这里还是重温一下,把链接附上: http://www.cnblogs.com/huangxie/p/5473273.html 这一篇我想写写如何爬取百度图片的爬虫 ...
利用python的爬虫技术爬取百度贴吧的帖子
在爬取糗事百科的段子后,我又在知乎上找了一个爬取百度贴吧帖子的实例,为了巩固提升已掌握的爬虫知识,于是我打算自己也做一个. 实现目标:1,爬取楼主所发的帖子 2,显示所爬去的楼层以及帖子题目 3,将爬 ...

随机推荐

如何使用Photoshop（PS）将图片的底色变为透明
很多时候需要将一张图片的底色变得透明.本文描述了使用PS将图片的一部分变得透明的方法.本例将一段艺术字的背景去掉,将背景透明的文字单独保存成图片,这样以后将这段文字粘贴到其他素材上的时候,就不用担心它 ...
Java与.NET机制比较分析
一.概述不管是什么语言开发的web应用程序,都是在解决一个问题,那就是用户输入url怎么把对应的页面响应出来,如何通过url映射到响应的类,由于自己做asp.net的时间也不短了,还算是对asp.n ...
Installing scipy on redhat with error “no lapack/blas resources found”
这是更新scipy出现的结果,需要新版本的scipy,而机器上只装了0.7的版本,更新的时候报错,找到了一个解决方法: wget http://mirror.centos.org/centos/6/o ...
Tomcat nginx log日志按天分割切割
利用 Linux 自带的 logrotate 工具来实现按天切割日志.下方已 centos 7 系统为例来实践讲解. 原理 Logrotate是基于CRON来运行的,其脚本是/etc/cron.dai ...
python eval() hasattr() getattr() setattr() 函数使用方法详解
eval() 函数 --- 将字符串str当成有效的表达式来求值并返回计算结果. 语法:eval(source[, globals[, locals]]) ---> value 参数: sour ...
python 数据提取之JSON与JsonPATH
JSON(JavaScript Object Notation) 是一种轻量级的数据交换格式,它使得人们很容易的进行阅读和编写.同时也方便了机器进行解析和生成.适用于进行数据交互的场景,比如网站前台与 ...
java算法-数学之美二
上一章已经说过利用数学思想来解决程序算法问题,实际上就是找规律.这在我们上学时经常遇到,比如给出一段数字,求某一个位置该填写什么数,只要找到规律那就迎刃而解.好了,废话不多说,再来看看案例分析. ...
poj 1419(图的着色问题，搜索)
题目链接:http://poj.org/problem?id=1419 思路:只怪数据太弱!直接爆搜,按顺序搜索即可. #include<iostream> #include<cst ...
server r2 系统更新文件清理
https://support.microsoft.com/zh-cn/kb/2852386
Python_selenium之窗口切换（二）
Python_selenium之窗口切换(二)一.思路拆分1. 之前有介绍窗口切换,这里加上断言部分2. 这里还是以百度新闻为例,获取百度新闻网址http://news.baidu.com/3. 同样 ...

爬虫任务一：使用httpclient去爬取百度新闻首页的新闻标题和url，编码是utf-8

爬虫任务一：使用httpclient去爬取百度新闻首页的新闻标题和url，编码是utf-8的更多相关文章

随机推荐

热门专题