python crawler

crawl blog website: www.apress.com

# -*- coding: utf-8 -*-

"""

Created on Wed May 10 18:01:41 2017

@author: Raghav Bali

"""

"""

This script crawls apress.com's blog page to:

    + extract list of recent blog post titles and their URLS

    + extract content related to each blog post in plain text

using requests and BeautifulSoup packages

``Execute``

        $ python crawl_bs.py

"""

import requests

from time import sleep

from bs4 import BeautifulSoup

def get_post_mapping(content):

    """This function extracts blog post title and url from response object

    Args:

        content (request.content): String content returned from requests.get

    Returns:

        list: a list of dictionaries with keys title and url

    """

    post_detail_list = []

    post_soup = BeautifulSoup(content,"lxml")

    h3_content = post_soup.find_all("h3")

    for h3 in h3_content:

        post_detail_list.append(

            {'title':h3.a.get_text(),'url':h3.a.attrs.get('href')}

            )

    return post_detail_list

def get_post_content(content):

    """This function extracts blog post content from response object

    Args:

        content (request.content): String content returned from requests.get

    Returns:

        str: blog's content in plain text

    """

    plain_text = ""

    text_soup = BeautifulSoup(content,"lxml")

    para_list = text_soup.find_all("div",

                                   {'class':'cms-richtext'})

    for p in para_list[0]:

        plain_text += p.getText()

    return plain_text

if __name__ =='__main__':

    crawl_url = "http://www.apress.com/in/blog/all-blog-posts"

    post_url_prefix = "http://www.apress.com"

    print("Crawling Apress.com for recent blog posts...\n\n")    

    response = requests.get(crawl_url)

    if response.status_code == 200:

        blog_post_details = get_post_mapping(response.content)

    if blog_post_details:

        print("Blog posts found:{}".format(len(blog_post_details)))

        for post in blog_post_details:

            print("Crawling content for post titled:",post.get('title'))

            post_response = requests.get(post_url_prefix+post.get('url'))

            if post_response.status_code == 200:

                post['content'] = get_post_content(post_response.content)

            print("Waiting for 10 secs before crawling next post...\n\n")

            sleep(10)

        print("Content crawled for all posts")

        # print/write content to file

        for post in blog_post_details:

            print(post)

python crawler的更多相关文章

Python crawler access to web pages the get requests a cookie
Python in the process of accessing the web page,encounter with cookie,so we need to get it. cookie i ...
【python爬虫】根据查询词爬取网站返回结果
最近在做语义方面的问题,需要反义词.就在网上找反义词大全之类的,但是大多不全,没有我想要的.然后就找相关的网站,发现了http://fanyici.xpcha.com/5f7x868lizu.html ...
python脚本工具－ 3 目录遍历
遍历系统中某一目录下的所有文件名 #! /usr/bin/python # coding:utf-8 import os def dirList(path): filelist = os.listdi ...
pyrailgun 0.24 : Python Package Index
pyrailgun 0.24 : Python Package Index pyrailgun 0.24 Download pyrailgun-0.24.zip Fast Crawler For Py ...
[Python]新手写爬虫全过程（转）
今天早上起来,第一件事情就是理一理今天该做的事情,瞬间get到任务,写一个只用python字符串内建函数的爬虫,定义为v1.0,开发中的版本号定义为v0.x.数据存放?这个是一个练手的玩具,就写在tx ...
python编写知乎爬虫实践
爬虫的基本流程网络爬虫的基本工作流程如下: 首先选取一部分精心挑选的种子URL 将种子URL加入任务队列从待抓取URL队列中取出待抓取的URL,解析DNS,并且得到主机的ip,并将URL对应的网页 ...
python爬虫之urllib
#coding=utf-8 #urllib操作类 import time import urllib.request import urllib.parse from urllib.error imp ...
Python实现自动登录/登出校园网网关
学校校园网的网络连接有免费连接和收费连接两种类型,可想而知收费连接浏览体验更佳,比如可以访问更多的网站.之前收费地址只能开通包月服务才可使用,后来居然有了每个月60小时的免费使用收费地址的优惠.但是, ...
python爬虫实践
模拟登陆与文件下载爬取http://moodle.tipdm.com上面的视频并下载模拟登陆由于泰迪杯网站问题,测试之后发现无法用正常的账号密码登陆,这里会使用访客账号登陆. 我们先打开泰迪杯的 ...

随机推荐

Gearman介绍、原理分析、实践改进
gearman是什么? 它是分布式的程序调用框架,可完成跨语言的相互调用,适合在后台运行工作任务.最初是2005年perl版本,2008年发布C/C++版本.目前大部分源码都是(Gearmand服务j ...
div不换行的三种方法
原文:https://www.cnblogs.com/zouwangblog/p/11149621.html float <div class="div1">123&l ...
详细的Hadoop的入门教程-伪分布模式Pseudo-Distributed Operation
一. 伪分布模式Pseudo-Distributed Operation 这里关于VM虚拟机的安装就不再介绍了,详细请看<VMware虚拟机的三种网络管理模式>一章介绍.这章只介绍hado ...
jQuery---jq基础了解(语法,特性),JQ和JS的区别对比,JQ和JS相互转换,Jquery的选择器(基础选择器,层级选择器,属性选择器),Jquery的筛选器(基本筛选器,表单筛选器),Jquery筛选方法
jQuery---jq基础了解(语法,特性),JQ和JS的区别对比,JQ和JS相互转换,Jquery的选择器(基础选择器,层级选择器,属性选择器),Jquery的筛选器(基本筛选器,表单筛选器),Jq ...
Electron学习入门
1.安装electron,不建议全局安装,这样每个app可以使用不同的electron版本了 2.配置package.json中的script下的start属性的值为electron . Electr ...
Java 面向对象—非静态代码块
一.非静态代码块 1.声明格式 [修饰符] class 类名 { { 非静态代码块 } } 2.非静态代码块中的代码执行时机 (1)在"每次"创建对象的时候执行 (2)比构造方法早 ...
SAP技术 - How to create a CDS redirect view for a given database table
Scenario Suppose we have a database table A, and then we create a CDS redirect view B for it, then e ...
HashMap的源码分析与实现伸缩性角度看hashmap的不足
本文介绍 1.hashmap的概念 2.hashmap的源码分析 3.hashmap的手写实现 4.伸缩性角度看hashmap的不足一.HashMap的概念 HashMap可以将其拆分为Hash散列 ...
day 28
目录操作系统发展史穿孔卡片联机批处理系统脱机批处理系统多道技术(基于单核情况下研究) 单道多道技术并发与并行进程程序与进程进程调度先来先服务调度短作业优先调度时间片轮转法分 ...
MongoDB Spark Connector 实战指南
Why Spark with MongoDB? 高性能,官方号称 100x faster,因为可以全内存运行,性能提升肯定是很明显的简单易用,支持 Java.Python.Scala.SQL 等多种 ...

python crawler

python crawler的更多相关文章

随机推荐

热门专题