How To Crawl A Web Page with Scrapy and Python 3
sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频)
Introduction
Web scraping, often called web crawling or web spidering, or “programatically going over a collection of web pages and extracting data,” is a powerful tool for working with data on the web.
With a web scraper, you can mine data about a set of products, get a large corpus of text or quantitative data to play around with, get data from a site without an official API, or just satisfy your own personal curiosity.
In this tutorial, you’ll learn about the fundamentals of the scraping and spidering process as you explore a playful data set. We'll use BrickSet, a community-run site that contains information about LEGO sets. By the end of this tutorial, you’ll have a fully functional Python web scraper that walks through a series of pages on Brickset and extracts data about LEGO sets from each page, displaying the data to your screen.
The scraper will be easily expandable so you can tinker around with it and use it as a foundation for your own projects scraping data from the web.
Prerequisites
To complete this tutorial, you'll need a local development environment for Python 3. You can follow How To Install and Set Up a Local Programming Environment for Python 3to configure everything you need.
Step 1 — Creating a Basic Scraper
Scraping is a two step process:
- You systematically find and download web pages.
- You take those web pages and extract information from them.
Both of those steps can be implemented in a number of ways in many languages.
You can build a scraper from scratch using modules or libraries provided by your programming language, but then you have to deal with some potential headaches as your scraper grows more complex. For example, you'll need to handle concurrency so you can crawl more than one page at a time. You'll probably want to figure out how to transform your scraped data into different formats like CSV, XML, or JSON. And you'll sometimes have to deal with sites that require specific settings and access patterns.
You'll have better luck if you build your scraper on top of an existing library that handles those issues for you. For this tutorial, we're going to use Python and Scrapy to build our scraper.
Scrapy is one of the most popular and powerful Python scraping libraries; it takes a "batteries included" approach to scraping, meaning that it handles a lot of the common functionality that all scrapers need so developers don't have to reinvent the wheel each time. It makes scraping a quick and fun process!
Scrapy, like most Python packages, is on PyPI (also known as pip
). PyPI, the Python Package Index, is a community-owned repository of all published Python software.
If you have a Python installation like the one outlined in the prerequisite for this tutorial, you already have pip
installed on your machine, so you can install Scrapy with the following command:
- pip install scrapy
If you run into any issues with the installation, or you want to install Scrapy without using pip
, check out the official installation docs.
With Scrapy installed, let's create a new folder for our project. You can do this in the terminal by running:
- mkdir brickset-scraper
Now, navigate into the new directory you just created:
- cd brickset-scraper
Then create a new Python file for our scraper called scraper.py
. We'll place all of our code in this file for this tutorial. You can create this file in the terminal with the touch
command, like this:
- touch scraper.py
Or you can create the file using your text editor or graphical file manager.
We'll start by making a very basic scraper that uses Scrapy as its foundation. To do that, we'll create a Python class that subclasses scrapy.Spider
, a basic spider class provided by Scrapy. This class will have two required attributes:
name
— just a name for the spider.start_urls
— a list of URLs that you start to crawl from. We'll start with one URL.
Open the scrapy.py
file in your text editor and add this code to create the basic spider:
import scrapy
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider"
start_urls = ['http://brickset.com/sets/year-2016']
Let's break this down line by line:
First, we import scrapy
so that we can use the classes that the package provides.
Next, we take the Spider
class provided by Scrapy and make a subclass out of it called BrickSetSpider
. Think of a subclass as a more specialized form of its parent class. The Spider
subclass has methods and behaviors that define how to follow URLs and extract data from the pages it finds, but it doesn't know where to look or what data to look for. By subclassing it, we can give it that information.
Then we give the spider the name brickset_spider
.
Finally, we give our scraper a single URL to start from: http://brickset.com/sets/year-2016. If you open that URL in your browser, it will take you to a search results page, showing the first of many pages containing LEGO sets.
Now let's test out the scraper. You typically run Python files by running a command like python path/to/file.py
. However, Scrapy comes with its own command line interface to streamline the process of starting a scraper. Start your scraper with the following command:
- scrapy runspider scraper.py
You’ll see something like this:
2016-09-22 23:37:45 [scrapy] INFO: Scrapy 1.1.2 started (bot: scrapybot)
2016-09-22 23:37:45 [scrapy] INFO: Overridden settings: {}
2016-09-22 23:37:45 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-09-22 23:37:45 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
...
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-09-22 23:37:45 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
...
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-09-22 23:37:45 [scrapy] INFO: Enabled item pipelines:
[]
2016-09-22 23:37:45 [scrapy] INFO: Spider opened
2016-09-22 23:37:45 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-09-22 23:37:45 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-09-22 23:37:47 [scrapy] DEBUG: Crawled (200) <GET http://brickset.com/sets/year-2016> (referer: None)
2016-09-22 23:37:47 [scrapy] INFO: Closing spider (finished)
2016-09-22 23:37:47 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 224,
'downloader/request_count': 1,
...
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 9, 23, 6, 37, 45, 995167)}
2016-09-22 23:37:47 [scrapy] INFO: Spider closed (finished)
That's a lot of output, so let's break it down.
- The scraper initialized and loaded additional components and extensions it needed to handle reading data from URLs.
- It used the URL we provided in the
start_urls
list and grabbed the HTML, just like your web browser would do. - It passed that HTML to the
parse
method, which doesn't do anything by default. Since we never wrote our ownparse
method, the spider just finishes without doing any work.
Now let's pull some data from the page.
Step 2 — Extracting Data from a Page
We've created a very basic program that pulls down a page, but it doesn't do any scraping or spidering yet. Let’s give it some data to extract.
If you look at the page we want to scrape, you'll see it has the following structure:
- There's a header that’s present on every page.
- There's some top-level search data, including the number of matches, what we’re searching for, and the breadcrumbs for the site.
- Then there are the sets themselves, displayed in what looks like a table or ordered list. Each set has a similar format.
When writing a scraper, it's a good idea to look at the source of the HTML file and familiarize yourself with the structure. So here it is, with some things removed for readability:
<body>
<section class="setlist">
<article class='set'>
<a class="highslide plain mainimg" href=
"http://images.brickset.com/sets/images/10251-1.jpg?201510121127"
onclick="return hs.expand(this)"><img src=
"http://images.brickset.com/sets/small/10251-1.jpg?201510121127"
title="10251-1: Brick Bank"></a>
<div class="highslide-caption">
<h1><a href='/sets/10251-1/Brick-Bank'>Brick Bank</a></h1>
<div class='tags floatleft'>
<a href='/sets/10251-1/Brick-Bank'>10251-1</a> <a href=
'/sets/theme-Advanced-Models'>Advanced Models</a> <a class=
'subtheme' href=
'/sets/theme-Advanced-Models/subtheme-Modular-Buildings'>Modular
Buildings</a> <a class='year' href=
'/sets/theme-Advanced-Models/year-2016'>2016</a>
</div>
<div class='floatright'>
©2016 LEGO Group
</div>
<div class="pn">
<a href="#" onclick="return hs.previous(this)" title=
"Previous (left arrow key)">« Previous</a> <a href="#"
onclick="return hs.next(this)" title=
"Next (right arrow key)">Next »</a>
</div>
</div>
...
</article>
<article class='set'>
...
</article>
</section>
</body>
Scraping this page is a two step process:
- First, grab each LEGO set by looking for the parts of the page that have the data we want.
- Then, for each set, grab the data we want from it by pulling the data out of the HTML tags.
scrapy
grabs data based on selectors that you provide. Selectors are patterns we can use to find one or more elements on a page so we can then work with the data within the element. scrapy
supports either CSS selectors or XPath selectors.
We’ll use CSS selectors for now since CSS is the easier option and a perfect fit for finding all the sets on the page. If you look at the HTML for the page, you'll see that each set is specified with the class set
. Since we're looking for a class, we'd use .set
for our CSS selector. All we have to do is pass that selector into the response
object, like this:
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider"
start_urls = ['http://brickset.com/sets/year-2016']
def parse(self, response):
SET_SELECTOR = '.set'
for brickset in response.css(SET_SELECTOR):
pass
This code grabs all the sets on the page and loops over them to extract the data. Now let’s extract the data from those sets so we can display it.
Another look at the [source](view-source:brickset.com/sets/year-2016) of the page we're parsing tells us that the name of each set is stored within an a
tag inside an h1
tag for each set:
<h1><a href='/sets/10251-1/Brick-Bank'>Brick Bank</a></h1>
The brickset
object we’re looping over has its own css
method, so we can pass in a selector to locate child elements. Modify your code as follows to locate the name of the set and display it:
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider"
start_urls = ['http://brickset.com/sets/year-2016']
def parse(self, response):
SET_SELECTOR = '.set'
for brickset in response.css(SET_SELECTOR):
NAME_SELECTOR = 'h1 a ::text'
yield {
'name': brickset.css(NAME_SELECTOR).extract_first(),
}
Note: The trailing comma after extract_first()
isn't a typo. We're going to add more to this section soon, so we've left the comma there to make adding to this section easier later.
You’ll notice two things going on in this code:
- We append
::text
to our selector for the name. That’s a CSS pseudo-selector that fetches the text inside of thea
tag rather than the tag itself. - We call
extract_first()
on the object returned bybrickset.css(NAME_SELECTOR)
because we just want the first element that matches the selector. This gives us a string, rather than a list of elements.
Save the file and run the scraper again:
- scrapy runspider scraper.py
This time you'll see the names of the sets appear in the output:
...
[scrapy] DEBUG: Scraped from <200 http://brickset.com/sets/year-2016>
{'name': 'Brick Bank'}
[scrapy] DEBUG: Scraped from <200 http://brickset.com/sets/year-2016>
{'name': 'Volkswagen Beetle'}
[scrapy] DEBUG: Scraped from <200 http://brickset.com/sets/year-2016>
{'name': 'Big Ben'}
[scrapy] DEBUG: Scraped from <200 http://brickset.com/sets/year-2016>
{'name': 'Winter Holiday Train'}
...
Let's keep expanding on this by adding new selectors for images, pieces, and miniature figures, or minifigsthat come with a set.
Take another look at the HTML for a specific set:
<article class="set">
<a class="highslide plain mainimg" href="http://images.brickset.com/sets/images/10251-1.jpg?201510121127" onclick="return hs.expand(this)">
<img src="http://images.brickset.com/sets/small/10251-1.jpg?201510121127" title="10251-1: Brick Bank"></a>
...
<div class="meta">
<h1><a href="/sets/10251-1/Brick-Bank"><span>10251:</span> Brick Bank</a> </h1>
...
<div class="col">
<dl>
<dt>Pieces</dt>
<dd><a class="plain" href="/inventories/10251-1">2380</a></dd>
<dt>Minifigs</dt>
<dd><a class="plain" href="/minifigs/inset-10251-1">5</a></dd>
...
</dl>
</div>
...
</div>
</article>
We can see a few things by examining this code:
- The image for the set is stored in the
src
attribute of animg
tag inside ana
tag at the start of the set. We can use another CSS selector to fetch this value just like we did when we grabbed the name of each set. - Getting the number of pieces is a little trickier. There's a
dt
tag that contains the textPieces
, and then add
tag that follows it which contains the actual number of pieces. We'll use XPath, a query language for traversing XML, to grab this, because it's too complex to be represented using CSS selectors. - Getting the number of minifigs in a set is similar to getting the number of pieces. There's a
dt
tag that contains the textMinifigs
, followed by add
tag right after that with the number.
So, let's modify the scraper to get this new information:
class BrickSetSpider(scrapy.Spider):
name = 'brick_spider'
start_urls = ['http://brickset.com/sets/year-2016']
def parse(self, response):
SET_SELECTOR = '.set'
for brickset in response.css(SET_SELECTOR):
NAME_SELECTOR = 'h1 a ::text'
PIECES_SELECTOR = './/dl[dt/text() = "Pieces"]/dd/a/text()'
MINIFIGS_SELECTOR = './/dl[dt/text() = "Minifigs"]/dd[2]/a/text()'
IMAGE_SELECTOR = 'img ::attr(src)'
yield {
'name': brickset.css(NAME_SELECTOR).extract_first(),
'pieces': brickset.xpath(PIECES_SELECTOR).extract_first(),
'minifigs': brickset.xpath(MINIFIGS_SELECTOR).extract_first(),
'image': brickset.css(IMAGE_SELECTOR).extract_first(),
}
Save your changes and run the scraper again:
- scrapy runspider scraper.py
Now you’ll see that new data in the program's output:
2016-09-22 23:52:37 [scrapy] DEBUG: Scraped from <200 http://brickset.com/sets/year-2016>
{'minifigs': '5', 'pieces': '2380', 'name': 'Brick Bank', 'image': 'http://images.brickset.com/sets/small/10251-1.jpg?201510121127'}
2016-09-22 23:52:37 [scrapy] DEBUG: Scraped from <200 http://brickset.com/sets/year-2016>
{'minifigs': None, 'pieces': '1167', 'name': 'Volkswagen Beetle', 'image': 'http://images.brickset.com/sets/small/10252-1.jpg?201606140214'}
2016-09-22 23:52:37 [scrapy] DEBUG: Scraped from <200 http://brickset.com/sets/year-2016>
{'minifigs': None, 'pieces': '4163', 'name': 'Big Ben', 'image': 'http://images.brickset.com/sets/small/10253-1.jpg?201605190256'}
2016-09-22 23:52:37 [scrapy] DEBUG: Scraped from <200 http://brickset.com/sets/year-2016>
{'minifigs': None, 'pieces': None, 'name': 'Winter Holiday Train', 'image': 'http://images.brickset.com/sets/small/10254-1.jpg?201608110306'}
2016-09-22 23:52:37 [scrapy] DEBUG: Scraped from <200 http://brickset.com/sets/year-2016>
{'minifigs': None, 'pieces': None, 'name': 'XL Creative Brick Box', 'image': '/assets/images/misc/blankbox.gif'}
2016-09-22 23:52:37 [scrapy] DEBUG: Scraped from <200 http://brickset.com/sets/year-2016>
{'minifigs': None, 'pieces': '583', 'name': 'Creative Building Set', 'image': 'http://images.brickset.com/sets/small/10702-1.jpg?201511230710'}
Now let's turn this scraper into a spider that follows links.
Step 3 — Crawling Multiple Pages
We’ve successfully extracted data from that initial page, but we’re not progressing past it to see the rest of the results. The whole point of a spider is to detect and traverse links to other pages and grab data from those pages too.
You’ll notice that the top and bottom of each page has a little right carat (>
) that links to the next page of results. Here's the HTML for that:
<ul class="pagelength">
...
<li class="next">
<a href="http://brickset.com/sets/year-2017/page-2">›</a>
</li>
<li class="last">
<a href="http://brickset.com/sets/year-2016/page-32">»</a>
</li>
</ul>
As you can see, there's a li
tag with the class of next
, and inside that tag, there's an a
tag with a link to the next page. All we have to do is tell the scraper to follow that link if it exists.
Modify your code as follows:
class BrickSetSpider(scrapy.Spider):
name = 'brick_spider'
start_urls = ['http://brickset.com/sets/year-2016']
def parse(self, response):
SET_SELECTOR = '.set'
for brickset in response.css(SET_SELECTOR):
NAME_SELECTOR = 'h1 a ::text'
PIECES_SELECTOR = './/dl[dt/text() = "Pieces"]/dd/a/text()'
MINIFIGS_SELECTOR = './/dl[dt/text() = "Minifigs"]/dd[2]/a/text()'
IMAGE_SELECTOR = 'img ::attr(src)'
yield {
'name': brickset.css(NAME_SELECTOR).extract_first(),
'pieces': brickset.xpath(PIECES_SELECTOR).extract_first(),
'minifigs': brickset.xpath(MINIFIGS_SELECTOR).extract_first(),
'image': brickset.css(IMAGE_SELECTOR).extract_first(),
}
NEXT_PAGE_SELECTOR = '.next a ::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)
First, we define a selector for the "next page" link, extract the first match, and check if it exists. The scrapy.Request
is a value that we return saying “Hey, crawl this page”, and callback=self.parse
says “once you’ve gotten the HTML from this page, pass it back to this method so we can parse it, extract the data, and find the next page."
This means that once we go to the next page, we’ll look for a link to the next page there, and on that page we’ll look for a link to the next page, and so on, until we don't find a link for the next page. This is the key piece of web scraping: finding and following links. In this example, it’s very linear; one page has a link to the next page until we’ve hit the last page, But you could follow links to tags, or other search results, or any other URL you'd like.
Now, if you save your code and run the spider again you’ll see that it doesn't just stop once it iterates through the first page of sets. It keeps on going through all 779 matches on 23 pages! In the grand scheme of things it’s not a huge chunk of data, but now you know the process by which you automatically find new pages to scrape.
Here’s our completed code for this tutorial, using Python-specific highlighting:
import scrapy
class BrickSetSpider(scrapy.Spider):
name = 'brick_spider'
start_urls = ['http://brickset.com/sets/year-2016']
def parse(self, response):
SET_SELECTOR = '.set'
for brickset in response.css(SET_SELECTOR):
NAME_SELECTOR = 'h1 a ::text'
PIECES_SELECTOR = './/dl[dt/text() = "Pieces"]/dd/a/text()'
MINIFIGS_SELECTOR = './/dl[dt/text() = "Minifigs"]/dd[2]/a/text()'
IMAGE_SELECTOR = 'img ::attr(src)'
yield {
'name': brickset.css(NAME_SELECTOR).extract_first(),
'pieces': brickset.xpath(PIECES_SELECTOR).extract_first(),
'minifigs': brickset.xpath(MINIFIGS_SELECTOR).extract_first(),
'image': brickset.css(IMAGE_SELECTOR).extract_first(),
}
NEXT_PAGE_SELECTOR = '.next a ::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)
Conclusion
In this tutorial you built a fully-functional spider that extracts data from web pages in less than thirty lines of code. That's a great start, but there’s a lot of fun things you can do with this spider. Here are some ways you could expand the code you've written. They'll give you some practice scraping data.
- Right now we’re only parsing results from 2016, as you might have guessed from the
2016
part ofhttp://brickset.com/sets/year-2016
— how would you crawl results from other years? - There's a retail price included on most sets. How do you extract the data from that cell? How would you get a raw number out of it? Hint: you'll find the data in a
dt
just like the number of pieces and minifigs. - Most of the results have tags that specify semantic data about the sets or their context. How do we crawl these, given that there are multiple tags for a single set?
That should be enough to get you thinking and experimenting. If you need more information on Scrapy, check out Scrapy’s official docs. For more information on working with data from the web, see our tutorial on "How To Scrape Web Pages with Beautiful Soup and Python 3".
How To Crawl A Web Page with Scrapy and Python 3的更多相关文章
- [Project] Simulate HTTP Post Request to obtain data from Web Page by using Python Scrapy Framework
1. Background Though it's always difficult to give child a perfect name, parent never give up trying ...
- 第三百二十四节,web爬虫,scrapy模块介绍与使用
第三百二十四节,web爬虫,scrapy模块介绍与使用 Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中.其最初是为了 ...
- 解读Web Page Diagnostics网页细分图
解读Web Page Diagnostics网页细分图 http://blog.sina.com.cn/s/blog_62b8fc330100red5.html Web Page Diagnostic ...
- 网页细分图结果分析(Web Page Diagnostics)
Discuz开源论坛网页细分图结果分析(Web Page Diagnostics) 续LR实战之Discuz开源论坛项目,之前一直是创建虚拟用户脚本(Virtual User Generator)和场 ...
- Atitit.web三大编程模型 Web Page Web Forms 和 MVC
Atitit.web三大编程模型 Web Page Web Forms 和 MVC 1. 编程模型是 Web Forms 和 MVC (Model, View, Controller). 2. ...
- [转]Calling Web Service Functions Asynchronously from a Web Page 异步调用WebServices
本文转自:http://www.codeproject.com/Articles/70441/Calling-Web-Service-Functions-Asynchronously-from Ove ...
- Tutorial: Importing and analyzing data from a Web Page using Power BI Desktop
In this tutorial, you will learn how to import a table of data from a Web page and create a report t ...
- Android WebView常见问题的解决方案总结----例如Web page not available
之前android虚拟机一直都可以直接联网,今天写了一个WebView之后,突然报出了Web page not available的错误,但是查看虚拟机自带的浏览器,是可以上网的,所以检查还是代码的问 ...
- LR实战之Discuz开源论坛——网页细分图结果分析(Web Page Diagnostics)
续LR实战之Discuz开源论坛项目,之前一直是创建虚拟用户脚本(Virtual User Generator)和场景(Controller),现在,终于到了LoadRunner性能测试结果分析(An ...
随机推荐
- 【第四课】Linux的基础命令使用
目录 一.passwd重置密码 二.单用户模式 三.救援模式 四.设置SElinux 五.Linux的常用基础命令详解 5.1.mkdir命令 5.2.ls命令 5.3.cd命令 5.4.chmod命 ...
- CF708D Incorrect Flow
CF708D Incorrect Flow 有源汇上下界最小费用可行流.(= =) 对每条给定的边连边: 首先\(f_i\)是给定的,所以要有一条这个边而且要流满,先\(a_i-b_i\)连一条上下界 ...
- ECMAScript6——Set数据结构
/** * 数据结构 Set */ // ----------------------------------------------------- /** * 集合的基本概念:集合是由一组无序且唯一 ...
- <转>性能测试浅谈
本文主要针对WEB系统的性能测试.不涉及具体的执行操作,只是本人对性能测试的一点理解和认识. 性能测试的目的,简单说其实就是为了获取待测系统的响应时间.吞吐量.稳定性.容量等信息.而发现一些具体的性能 ...
- 按键精灵对APP自动化测试(下)
上一篇介绍了安卓app上使用按键精灵的实践,这里再来说说苹果上的app. 由于iOS相关工具对操作系统的限制,目前在iOS10.0.2系统上应用成功. 二. 苹果手机按键精灵APP录制 适 ...
- jmeter实战1
- 172. Remove Element【LintCode by java】
Description Given an array and a value, remove all occurrences of that value in place and return the ...
- 详细聊聊k8s deployment的滚动更新(二)
一.知识准备 ● 本文详细探索deployment在滚动更新时候的行为 ● 相关的参数介绍: livenessProbe:存活性探测.判断pod是否已经停止 readinessProbe:就绪 ...
- LeetCode-3.无重复字符的最长字串
给定一个字符串,请你找出其中不含有重复字符的 最长子串 的长度. 示例 1: 输入: "abcabcbb" 输出: 3 解释: 因为无重复字符的最长子串是 "abc&qu ...
- [咸恩静][Real Love]
歌词来源:http://music.163.com/#/song?id=484056972 作曲 : Andrew 崔/Brian Cho/220 [作曲 : Andrew 崔/Brian Cho/2 ...