Beautifulsoup 和selenium 的查询

Selenium

There are vaious strategies to locate elements in a page. You can use the most appropriate one for your case. Selenium provides the following methods to locate elements in a page:

find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector

To find multiple elements (these methods will return a list):

find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector

Apart from the public methods given above, there are two private methods which might be useful with locators in page objects. These are the two private methods: find_element and find_elements.

Example usage:

from selenium.webdriver.common.by import By

driver.find_element(By.XPATH, '//button[text()="Some text"]')

driver.find_elements(By.XPATH, '//button')

These are the attributes available for By class:

ID = "id"

XPATH = "xpath"

LINK_TEXT = "link text"

PARTIAL_LINK_TEXT = "partial link text"

NAME = "name"

TAG_NAME = "tag name"

CLASS_NAME = "class name"

CSS_SELECTOR = "css selector"

Locating by Id

Use this when you know id attribute of an element. With this strategy, the first element with the idattribute value matching the location will be returned. If no element has a matching id attribute, aNoSuchElementException will be raised.

For instance, consider this page source:

<html>

 <body>

  <form id="loginForm">

   <input name="username" type="text" />

   <input name="password" type="password" />

   <input name="continue" type="submit" value="Login" />

  </form>

 </body>

<html>

The form element can be located like this:

login_form = driver.find_element_by_id('loginForm')

Locating by Name

Use this when you know name attribute of an element. With this strategy, the first element with thename attribute value matching the location will be returned. If no element has a matching nameattribute, a NoSuchElementException will be raised.

For instance, consider this page source:

<html>

 <body>

  <form id="loginForm">

   <input name="username" type="text" />

   <input name="password" type="password" />

   <input name="continue" type="submit" value="Login" />

   <input name="continue" type="button" value="Clear" />

  </form>

</body>

<html>

The username & password elements can be located like this:

username = driver.find_element_by_name('username')

password = driver.find_element_by_name('password')

This will give the “Login” button as it occur before the “Clear” button:

continue = driver.find_element_by_name('continue')

Locating by XPath

XPath is the language used for locating nodes in an XML document. As HTML can be an implementation of XML (XHTML), Selenium users can leverage this powerful language to target elements in their web applications. XPath extends beyond (as well as supporting) the simple methods of locating by id or name attributes, and opens up all sorts of new possibilities such as locating the third checkbox on the page.

One of the main reasons for using XPath is when you don’t have a suitable id or name attribute for the element you wish to locate. You can use XPath to either locate the element in absolute terms (not advised), or relative to an element that does have an id or name attribute. XPath locators can also be used to specify elements via attributes other than id and name.

Absolute XPaths contain the location of all elements from the root (html) and as a result are likely to fail with only the slightest adjustment to the application. By finding a nearby element with an id or name attribute (ideally a parent element) you can locate your target element based on the relationship. This is much less likely to change and can make your tests more robust.

For instance, consider this page source:

<html>

 <body>

  <form id="loginForm">

   <input name="username" type="text" />

   <input name="password" type="password" />

   <input name="continue" type="submit" value="Login" />

   <input name="continue" type="button" value="Clear" />

  </form>

</body>

<html>

The form elements can be located like this:

login_form = driver.find_element_by_xpath("/html/body/form[1]")

login_form = driver.find_element_by_xpath("//form[1]")

login_form = driver.find_element_by_xpath("//form[@id='loginForm']")

Absolute path (would break if the HTML was changed only slightly)
First form element in the HTML
The form element with attribute named id and the value loginForm

The username element can be located like this:

username = driver.find_element_by_xpath("//form[input/@name='username']")

username = driver.find_element_by_xpath("//form[@id='loginForm']/input[1]")

username = driver.find_element_by_xpath("//input[@name='username']")

First form element with an input child element with attribute named name and the valueusername
First input child element of the form element with attribute named id and the value loginForm
First input element with attribute named ‘name’ and the value username

The “Clear” button element can be located like this:

clear_button = driver.find_element_by_xpath("//input[@name='continue'][@type='button']")

clear_button = driver.find_element_by_xpath("//form[@id='loginForm']/input[4]")

Input with attribute named name and the value continue and attribute named type and the valuebutton
Fourth input child element of the form element with attribute named id and value loginForm

These examples cover some basics, but in order to learn more, the following references are recommended:

W3Schools XPath Tutorial
W3C XPath Recommendation
XPath Tutorial - with interactive examples.

There are also a couple of very useful Add-ons that can assist in discovering the XPath of an element:

XPath Checker - suggests XPath and can be used to test XPath results.
Firebug - XPath suggestions are just one of the many powerful features of this very useful add-on.
XPath Helper - for Google Chrome

Locating Hyperlinks by Link Text

Use this when you know link text used within an anchor tag. With this strategy, the first element with the link text value matching the location will be returned. If no element has a matching link text attribute, a NoSuchElementException will be raised.

For instance, consider this page source:

<html>

 <body>

  <p>Are you sure you want to do this?</p>

  <a href="continue.html">Continue</a>

  <a href="cancel.html">Cancel</a>

</body>

<html>

The continue.html link can be located like this:

continue_link = driver.find_element_by_link_text('Continue')

continue_link = driver.find_element_by_partial_link_text('Conti')

Locating Elements by Tag Name

Use this when you want to locate an element by tag name. With this strategy, the first element with the given tag name will be returned. If no element has a matching tag name, a NoSuchElementExceptionwill be raised.

For instance, consider this page source:

<html>

 <body>

  <h1>Welcome</h1>

  <p>Site content goes here.</p>

</body>

<html>

The heading (h1) element can be located like this:

heading1 = driver.find_element_by_tag_name('h1')

Locating Elements by Class Name

Use this when you want to locate an element by class attribute name. With this strategy, the first element with the matching class attribute name will be returned. If no element has a matching class attribute name, a NoSuchElementException will be raised.

For instance, consider this page source:

<html>

 <body>

  <p class="content">Site content goes here.</p>

</body>

<html>

The “p” element can be located like this:

content = driver.find_element_by_class_name('content')

Locating Elements by CSS Selectors

Use this when you want to locate an element by CSS selector syntaxt. With this strategy, the first element with the matching CSS selector will be returned. If no element has a matching CSS selector, a NoSuchElementException will be raised.

For instance, consider this page source:

<html>

 <body>

  <p class="content">Site content goes here.</p>

</body>

<html>

The “p” element can be located like this:

content = driver.find_element_by_css_selector('p.content')

Beautifulsoup

The `name` argument

Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match.

This is the simplest usage:

soup.find_all("title")

# [<title>The Dormouse's story</title>]

Recall from Kinds of filters that the value to name can be a string, a regular expression, a list, a function, or the value True.

The keyword arguments

Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument called id, Beautiful Soup will filter against each tag’s ‘id’ attribute:

soup.find_all(id='link2')

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

If you pass in a value for href, Beautiful Soup will filter against each tag’s ‘href’ attribute:

soup.find_all(href=re.compile("elsie"))

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

You can filter an attribute based on a string, a regular expression, a list, a function, or the value True.

This code finds all tags whose id attribute has a value, regardless of what the value is:

soup.find_all(id=True)

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

You can filter multiple attributes at once by passing in more than one keyword argument:

soup.find_all(href=re.compile("elsie"), id='link1')

# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

Some attributes, like the data-* attributes in HTML 5, have names that can’t be used as the names of keyword arguments:

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

data_soup.find_all(data-foo="value")

# SyntaxError: keyword can't be an expression

You can use these attributes in searches by putting them into a dictionary and passing the dictionary into find_all() as the attrs argument:

data_soup.find_all(attrs={"data-foo": "value"})

# [<div data-foo="value">foo!</div>]

Searching by CSS class

It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:

soup.find_all("a", class_="sister")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

As with any keyword argument, you can pass class_ a string, a regular expression, a function, or True:

soup.find_all(class_=re.compile("itl"))

# [<p class="title"><b>The Dormouse's story</b></p>]

def has_six_characters(css_class):

    return css_class is not None and len(css_class) == 6

soup.find_all(class_=has_six_characters)

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Remember that a single tag can have multiple values for its “class” attribute. When you search for a tag that matches a certain CSS class, you’re matching against any of its CSS classes:

css_soup = BeautifulSoup('<p class="body strikeout"></p>')

css_soup.find_all("p", class_="strikeout")

# [<p class="body strikeout"></p>]

css_soup.find_all("p", class_="body")

# [<p class="body strikeout"></p>]

You can also search for the exact string value of the class attribute:

css_soup.find_all("p", class_="body strikeout")

# [<p class="body strikeout"></p>]

But searching for variants of the string value won’t work:

css_soup.find_all("p", class_="strikeout body")

# []

If you want to search for tags that match two or more CSS classes, you should use a CSS selector:

css_soup.select("p.strikeout.body")

# [<p class="body strikeout"></p>]

In older versions of Beautiful Soup, which don’t have the class_ shortcut, you can use the attrs trick mentioned above. Create a dictionary whose value for “class” is the string (or regular expression, or whatever) you want to search for:

soup.find_all("a", attrs={"class": "sister"})

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

The `text` argument

With text you can search for strings instead of tags. As with name and the keyword arguments, you can pass in a string, a regular expression,a list, a function, or the value True. Here are some examples:

soup.find_all(text="Elsie")

# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])

# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))

[u"The Dormouse's story", u"The Dormouse's story"]

def is_the_only_string_within_a_tag(s):

    """Return True if this string is the only child of its parent tag."""

    return (s == s.parent.string)

soup.find_all(text=is_the_only_string_within_a_tag)

# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

Although text is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for text. This code finds the <a> tags whose .string is “Elsie”:

soup.find_all("a", text="Elsie")

# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

The `limit` argument

find_all() returns all the tags and strings that match your filters. This can take a while if the document is large. If you don’t need all the results, you can pass in a number for limit. This works just like the LIMIT keyword in SQL. It tells Beautiful Soup to stop gathering results after it’s found a certain number.

There are three links in the “three sisters” document, but this code only finds the first two:

soup.find_all("a", limit=2)

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

The `recursive` argument

If you call mytag.find_all(), Beautiful Soup will examine all the descendants of mytag: its children, its children’s children, and so on. If you only want Beautiful Soup to consider direct children, you can pass in recursive=False. See the difference here:

soup.html.find_all("title")

# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)

# []

Here’s that part of the document:

<html>

 <head>

  <title>

   The Dormouse's story

  </title>

 </head>

...

The <title> tag is beneath the <html> tag, but it’s not directly beneath the <html> tag: the <head> tag is in the way. Beautiful Soup finds the <title> tag when it’s allowed to look at all descendants of the <html> tag, but when recursive=False restricts it to the <html> tag’s immediate children, it finds nothing.

Beautiful Soup offers a lot of tree-searching methods (covered below), and they mostly take the same arguments as find_all(): name, attrs,text, limit, and the keyword arguments. But the recursive argument is different: find_all() and find() are the only methods that support it. Passing recursive=False into a method like find_parents() wouldn’t be very useful.

Calling a tag is like calling `find_all()`

Because find_all() is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoupobject or a Tag object as though it were a function, then it’s the same as calling find_all() on that object. These two lines of code are equivalent:

soup.find_all("a")

soup("a")

These two lines are also equivalent:

soup.title.find_all(text=True)

soup.title(text=True)

`find()`

Signature: find(name, attrs, recursive, text, **kwargs)

The find_all() method scans the entire document looking for results, but sometimes you only want to find one result. If you know a document only has one <body> tag, it’s a waste of time to scan the entire document looking for more. Rather than passing in limit=1 every time you call find_all, you can use the find() method. These two lines of code are nearly equivalent:

soup.find_all('title', limit=1)

# [<title>The Dormouse's story</title>]

soup.find('title')

# <title>The Dormouse's story</title>

The only difference is that find_all() returns a list containing the single result, and find() just returns the result.

If find_all() can’t find anything, it returns an empty list. If find() can’t find anything, it returns None: