Beautifulsoup 和selenium 的查询
Selenium
There are vaious strategies to locate elements in a page. You can use the most appropriate one for your case. Selenium provides the following methods to locate elements in a page:
- find_element_by_id
- find_element_by_name
- find_element_by_xpath
- find_element_by_link_text
- find_element_by_partial_link_text
- find_element_by_tag_name
- find_element_by_class_name
- find_element_by_css_selector
To find multiple elements (these methods will return a list):
- find_elements_by_name
- find_elements_by_xpath
- find_elements_by_link_text
- find_elements_by_partial_link_text
- find_elements_by_tag_name
- find_elements_by_class_name
- find_elements_by_css_selector
Apart from the public methods given above, there are two private methods which might be useful with locators in page objects. These are the two private methods: find_element and find_elements.
Example usage:
from selenium.webdriver.common.by import By driver.find_element(By.XPATH, '//button[text()="Some text"]')
driver.find_elements(By.XPATH, '//button')
These are the attributes available for By class:
ID = "id"
XPATH = "xpath"
LINK_TEXT = "link text"
PARTIAL_LINK_TEXT = "partial link text"
NAME = "name"
TAG_NAME = "tag name"
CLASS_NAME = "class name"
CSS_SELECTOR = "css selector"
Locating by Id
Use this when you know id attribute of an element. With this strategy, the first element with the idattribute value matching the location will be returned. If no element has a matching id attribute, aNoSuchElementException will be raised.
For instance, consider this page source:
<html>
<body>
<form id="loginForm">
<input name="username" type="text" />
<input name="password" type="password" />
<input name="continue" type="submit" value="Login" />
</form>
</body>
<html>
The form element can be located like this:
login_form = driver.find_element_by_id('loginForm')
Locating by Name
Use this when you know name attribute of an element. With this strategy, the first element with thename attribute value matching the location will be returned. If no element has a matching nameattribute, a NoSuchElementException will be raised.
For instance, consider this page source:
<html>
<body>
<form id="loginForm">
<input name="username" type="text" />
<input name="password" type="password" />
<input name="continue" type="submit" value="Login" />
<input name="continue" type="button" value="Clear" />
</form>
</body>
<html>
The username & password elements can be located like this:
username = driver.find_element_by_name('username')
password = driver.find_element_by_name('password')
This will give the “Login” button as it occur before the “Clear” button:
continue = driver.find_element_by_name('continue')
Locating by XPath
XPath is the language used for locating nodes in an XML document. As HTML can be an implementation of XML (XHTML), Selenium users can leverage this powerful language to target elements in their web applications. XPath extends beyond (as well as supporting) the simple methods of locating by id or name attributes, and opens up all sorts of new possibilities such as locating the third checkbox on the page.
One of the main reasons for using XPath is when you don’t have a suitable id or name attribute for the element you wish to locate. You can use XPath to either locate the element in absolute terms (not advised), or relative to an element that does have an id or name attribute. XPath locators can also be used to specify elements via attributes other than id and name.
Absolute XPaths contain the location of all elements from the root (html) and as a result are likely to fail with only the slightest adjustment to the application. By finding a nearby element with an id or name attribute (ideally a parent element) you can locate your target element based on the relationship. This is much less likely to change and can make your tests more robust.
For instance, consider this page source:
<html>
<body>
<form id="loginForm">
<input name="username" type="text" />
<input name="password" type="password" />
<input name="continue" type="submit" value="Login" />
<input name="continue" type="button" value="Clear" />
</form>
</body>
<html>
The form elements can be located like this:
login_form = driver.find_element_by_xpath("/html/body/form[1]")
login_form = driver.find_element_by_xpath("//form[1]")
login_form = driver.find_element_by_xpath("//form[@id='loginForm']")
- Absolute path (would break if the HTML was changed only slightly)
- First form element in the HTML
- The form element with attribute named id and the value loginForm
The username element can be located like this:
username = driver.find_element_by_xpath("//form[input/@name='username']")
username = driver.find_element_by_xpath("//form[@id='loginForm']/input[1]")
username = driver.find_element_by_xpath("//input[@name='username']")
- First form element with an input child element with attribute named name and the valueusername
- First input child element of the form element with attribute named id and the value loginForm
- First input element with attribute named ‘name’ and the value username
The “Clear” button element can be located like this:
clear_button = driver.find_element_by_xpath("//input[@name='continue'][@type='button']")
clear_button = driver.find_element_by_xpath("//form[@id='loginForm']/input[4]")
- Input with attribute named name and the value continue and attribute named type and the valuebutton
- Fourth input child element of the form element with attribute named id and value loginForm
These examples cover some basics, but in order to learn more, the following references are recommended:
- W3Schools XPath Tutorial
- W3C XPath Recommendation
- XPath Tutorial - with interactive examples.
There are also a couple of very useful Add-ons that can assist in discovering the XPath of an element:
- XPath Checker - suggests XPath and can be used to test XPath results.
- Firebug - XPath suggestions are just one of the many powerful features of this very useful add-on.
- XPath Helper - for Google Chrome
Locating Hyperlinks by Link Text
Use this when you know link text used within an anchor tag. With this strategy, the first element with the link text value matching the location will be returned. If no element has a matching link text attribute, a NoSuchElementException will be raised.
For instance, consider this page source:
<html>
<body>
<p>Are you sure you want to do this?</p>
<a href="continue.html">Continue</a>
<a href="cancel.html">Cancel</a>
</body>
<html>
The continue.html link can be located like this:
continue_link = driver.find_element_by_link_text('Continue')
continue_link = driver.find_element_by_partial_link_text('Conti')
Locating Elements by Tag Name
Use this when you want to locate an element by tag name. With this strategy, the first element with the given tag name will be returned. If no element has a matching tag name, a NoSuchElementExceptionwill be raised.
For instance, consider this page source:
<html>
<body>
<h1>Welcome</h1>
<p>Site content goes here.</p>
</body>
<html>
The heading (h1) element can be located like this:
heading1 = driver.find_element_by_tag_name('h1')
Locating Elements by Class Name
Use this when you want to locate an element by class attribute name. With this strategy, the first element with the matching class attribute name will be returned. If no element has a matching class attribute name, a NoSuchElementException will be raised.
For instance, consider this page source:
<html>
<body>
<p class="content">Site content goes here.</p>
</body>
<html>
The “p” element can be located like this:
content = driver.find_element_by_class_name('content')
Locating Elements by CSS Selectors
Use this when you want to locate an element by CSS selector syntaxt. With this strategy, the first element with the matching CSS selector will be returned. If no element has a matching CSS selector, a NoSuchElementException will be raised.
For instance, consider this page source:
<html>
<body>
<p class="content">Site content goes here.</p>
</body>
<html>
The “p” element can be located like this:
content = driver.find_element_by_css_selector('p.content') Beautifulsoup
The name argument
Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match.
This is the simplest usage:
soup.find_all("title")
# [<title>The Dormouse's story</title>]
Recall from Kinds of filters that the value to name can be a string, a regular expression, a list, a function, or the value True.
The keyword arguments
Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument called id, Beautiful Soup will filter against each tag’s ‘id’ attribute:
soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
If you pass in a value for href, Beautiful Soup will filter against each tag’s ‘href’ attribute:
soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
You can filter an attribute based on a string, a regular expression, a list, a function, or the value True.
This code finds all tags whose id attribute has a value, regardless of what the value is:
soup.find_all(id=True)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
You can filter multiple attributes at once by passing in more than one keyword argument:
soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]
Some attributes, like the data-* attributes in HTML 5, have names that can’t be used as the names of keyword arguments:
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
You can use these attributes in searches by putting them into a dictionary and passing the dictionary into find_all() as the attrs argument:
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]
Searching by CSS class
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:
soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
As with any keyword argument, you can pass class_ a string, a regular expression, a function, or True:
soup.find_all(class_=re.compile("itl"))
# [<p class="title"><b>The Dormouse's story</b></p>] def has_six_characters(css_class):
return css_class is not None and len(css_class) == 6 soup.find_all(class_=has_six_characters)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Remember that a single tag can have multiple values for its “class” attribute. When you search for a tag that matches a certain CSS class, you’re matching against any of its CSS classes:
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.find_all("p", class_="strikeout")
# [<p class="body strikeout"></p>] css_soup.find_all("p", class_="body")
# [<p class="body strikeout"></p>]
You can also search for the exact string value of the class attribute:
css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]
But searching for variants of the string value won’t work:
css_soup.find_all("p", class_="strikeout body")
# []
If you want to search for tags that match two or more CSS classes, you should use a CSS selector:
css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]
In older versions of Beautiful Soup, which don’t have the class_ shortcut, you can use the attrs trick mentioned above. Create a dictionary whose value for “class” is the string (or regular expression, or whatever) you want to search for:
soup.find_all("a", attrs={"class": "sister"})
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
The text argument
With text you can search for strings instead of tags. As with name and the keyword arguments, you can pass in a string, a regular expression,a list, a function, or the value True. Here are some examples:
soup.find_all(text="Elsie")
# [u'Elsie'] soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie'] soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"] def is_the_only_string_within_a_tag(s):
"""Return True if this string is the only child of its parent tag."""
return (s == s.parent.string) soup.find_all(text=is_the_only_string_within_a_tag)
# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']
Although text is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for text. This code finds the <a> tags whose .string is “Elsie”:
soup.find_all("a", text="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
The limit argument
find_all() returns all the tags and strings that match your filters. This can take a while if the document is large. If you don’t need all the results, you can pass in a number for limit. This works just like the LIMIT keyword in SQL. It tells Beautiful Soup to stop gathering results after it’s found a certain number.
There are three links in the “three sisters” document, but this code only finds the first two:
soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
The recursive argument
If you call mytag.find_all(), Beautiful Soup will examine all the descendants of mytag: its children, its children’s children, and so on. If you only want Beautiful Soup to consider direct children, you can pass in recursive=False. See the difference here:
soup.html.find_all("title")
# [<title>The Dormouse's story</title>] soup.html.find_all("title", recursive=False)
# []
Here’s that part of the document:
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
...
The <title> tag is beneath the <html> tag, but it’s not directly beneath the <html> tag: the <head> tag is in the way. Beautiful Soup finds the <title> tag when it’s allowed to look at all descendants of the <html> tag, but when recursive=False restricts it to the <html> tag’s immediate children, it finds nothing.
Beautiful Soup offers a lot of tree-searching methods (covered below), and they mostly take the same arguments as find_all(): name, attrs,text, limit, and the keyword arguments. But the recursive argument is different: find_all() and find() are the only methods that support it. Passing recursive=False into a method like find_parents() wouldn’t be very useful.
Calling a tag is like calling find_all()
Because find_all() is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoupobject or a Tag object as though it were a function, then it’s the same as calling find_all() on that object. These two lines of code are equivalent:
soup.find_all("a")
soup("a")
These two lines are also equivalent:
soup.title.find_all(text=True)
soup.title(text=True)
find()
Signature: find(name, attrs, recursive, text, **kwargs)
The find_all() method scans the entire document looking for results, but sometimes you only want to find one result. If you know a document only has one <body> tag, it’s a waste of time to scan the entire document looking for more. Rather than passing in limit=1 every time you call find_all, you can use the find() method. These two lines of code are nearly equivalent:
soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>] soup.find('title')
# <title>The Dormouse's story</title>
The only difference is that find_all() returns a list containing the single result, and find() just returns the result.
If find_all() can’t find anything, it returns an empty list. If find() can’t find anything, it returns None:
Beautifulsoup 和selenium 的查询的更多相关文章
- Beautifulsoup和selenium的简单使用
Beautifulsoup和selenium的简单使用 requests库的复习 好久没用requests了,因为一会儿要写个简单的爬虫,所以还是随便写一点复习下. import requests r ...
- [python] 网络数据采集 操作清单 BeautifulSoup、Selenium、Tesseract、CSV等
Python网络数据采集操作清单 BeautifulSoup.Selenium.Tesseract.CSV等 Python网络数据采集操作清单 BeautifulSoup.Selenium.Tesse ...
- BeautifulSoup使用手册(查询篇)
目录 开始使用呢 解析器 四种对象 tag对象 标签名(name) 属性值(Attributes) 多值属性 内容 Comment对象 prettify()方法 find_all方法 contents ...
- selenium+BeautifulSoup实现强大的爬虫功能
sublime下运行 1 下载并安装必要的插件 BeautifulSoup selenium phantomjs 采用方式可以下载后安装,本文采用pip pip install BeautifulSo ...
- 爬虫实例——爬取淘女郎相册(通过selenium、PhantomJS、BeautifulSoup爬取)
环境 操作系统:CentOS 6.7 32-bit Python版本:2.6.6 第三方插件 selenium PhantomJS BeautifulSoup 代码 # -*- coding: utf ...
- 暑假闲着没事第一弹:基于Django的长江大学教务处成绩查询系统
本篇文章涉及到的知识点有:Python爬虫,MySQL数据库,html/css/js基础,selenium和phantomjs基础,MVC设计模式,ORM(对象关系映射)框架,django框架(Pyt ...
- 孤荷凌寒自学python第八十五天配置selenium并进行模拟浏览器操作1
孤荷凌寒自学python第八十五天配置selenium并进行模拟浏览器操作1 (完整学习过程屏幕记录视频地址在文末) 要模拟进行浏览器操作,只用requests是不行的,因此今天了解到有专门的解决方案 ...
- Selenium自动化测试环境搭建汇总(一):Selenium+Eclipse+Junit+TestNG
第一步 安装JDK JDk1.7. 下载地址:http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-188026 ...
- selenium+chrome抓取淘宝搜索抓娃娃关键页面
最近迷上了抓娃娃,去富国海底世界抓了不少,完全停不下来,还下各种抓娃娃的软件,梦想着有一天买个抓娃娃的机器存家里~.~ 今天顺便抓了下马爸爸家抓娃娃机器的信息,晚辈只是觉得翻得手酸,本来100页的数据 ...
随机推荐
- 业务零影响!如何在Online环境中巧用MySQL传统复制技术【转】
业务零影响!如何在Online环境中巧用MySQL传统复制技术 这篇文章我并不会介绍如何部署一个MySQL复制环境或keepalived+双主环境,因为此类安装搭建的文章已经很多,大家也很熟悉.在这篇 ...
- MySQL解决"is marked as crashed and should be repaired"故障
具体报错如下: Table '.\Tablename\posts' is marked as crashed and should be repaired 提示说论坛的帖子表posts被标记有问题,需 ...
- MusiCode 批量下载指定歌手的所有专辑(已解除验证码限制)
一直想把喜欢的歌手的专辑全都归类并下载下来,由于那专辑数量实在太多了,再加上最近开始学习python,就想着何不用python写个脚本把下载过程自动化呢?所以就花了点时间写了这么个东西,分享给有需要的 ...
- JAVA的三大特征 封装继承多态- 简单总结
简单总结一下 封装-即从很多类的抽取相同的代码 写在一个类里. 好处是 代码的重用,安全. 继承-减少代码的书写. 其好处也是 代码的重用. 多态- 把不同的子类对象都当作父类来看,可以屏蔽不同子类对 ...
- angularJS 系列(六)---$emit(), $on(), $broadcast()的使用
下面以一个例子来讲述 angular 中的event system,有$emit(), $on(), $broadcast().效果图如下 下面的代码中,用到了 controller AS 的语法,具 ...
- 一个基于jQuery的简单树形菜单
在工作中的项目使用的是一个前端基于 jQuery easyui 的一个系统,其中左侧的主菜单使用的是 easyui 中的 tree 组件,不是太熟悉,不过感觉不是太好用. 比如 easyui 中的 t ...
- Spring 笔记1
1.在java开发领域,Spring相对于EJB来说是一种轻量级的,非侵入性的Java开发框架,曾经有两本很畅销的书<Expert one-on-one J2EE Design and Deve ...
- 关于easyUI的datagrid的编辑功能时的问题
编辑时,如果form中包含了id输入域,会发送一个{id,id}这样的字符串到服务端,因为javascript的function edit(){}逻辑中,已经拿到Id提交了.所以,编辑和添加功能共用的 ...
- Xcode之Alcatraz
Alcatraz的安装和使用 转发:http://www.cnblogs.com/wendingding/p/4964661.html 一.简单说明 Alcatraz 是一款 Xcode的插件管理工具 ...
- WordPress主题制作第二天
<?php if(have_posts()): while(have_posts()): the_post(); <!-- the_title(); the_permalink(); th ...