Beautiful源码:

  1. """Beautiful Soup
  2. Elixir and Tonic
  3. "The Screen-Scraper's Friend"
  4. http://www.crummy.com/software/BeautifulSoup/
  5.  
  6. Beautiful Soup uses a pluggable XML or HTML parser to parse a
  7. (possibly invalid) document into a tree representation. Beautiful Soup
  8. provides methods and Pythonic idioms that make it easy to navigate,
  9. search, and modify the parse tree.
  10.  
  11. Beautiful Soup works with Python 2.7 and up. It works better if lxml
  12. and/or html5lib is installed.
  13.  
  14. For more than you ever wanted to know about Beautiful Soup, see the
  15. documentation:
  16. http://www.crummy.com/software/BeautifulSoup/bs4/doc/
  17.  
  18. """
  19.  
  20. # Use of this source code is governed by a BSD-style license that can be
  21. # found in the LICENSE file.
  22.  
  23. __author__ = "Leonard Richardson (leonardr@segfault.org)"
  24. __version__ = "4.6.3"
  25. __copyright__ = "Copyright (c) 2004-2018 Leonard Richardson"
  26. __license__ = "MIT"
  27.  
  28. __all__ = ['BeautifulSoup']
  29.  
  30. import os
  31. import re
  32. import sys
  33. import traceback
  34. import warnings
  35.  
  36. from .builder import builder_registry, ParserRejectedMarkup
  37. from .dammit import UnicodeDammit
  38. from .element import (
  39. CData,
  40. Comment,
  41. DEFAULT_OUTPUT_ENCODING,
  42. Declaration,
  43. Doctype,
  44. NavigableString,
  45. PageElement,
  46. ProcessingInstruction,
  47. ResultSet,
  48. SoupStrainer,
  49. Tag,
  50. )
  51.  
  52. # The very first thing we do is give a useful error if someone is
  53. # running this code under Python 3 without converting it.
  54. 'You are trying to run the Python 2 version of Beautiful Soup under Python 3. This will not work.'!='You need to convert the code, either by installing it (`python setup.py install`) or by running 2to3 (`2to3 -w bs4`).'
  55.  
  56. class BeautifulSoup(Tag):
  57. """
  58. This class defines the basic interface called by the tree builders.
  59.  
  60. These methods will be called by the parser:
  61. reset()
  62. feed(markup)
  63.  
  64. The tree builder may call these methods from its feed() implementation:
  65. handle_starttag(name, attrs) # See note about return value
  66. handle_endtag(name)
  67. handle_data(data) # Appends to the current data node
  68. endData(containerClass=NavigableString) # Ends the current data node
  69.  
  70. No matter how complicated the underlying parser is, you should be
  71. able to build a tree using 'start tag' events, 'end tag' events,
  72. 'data' events, and "done with data" events.
  73.  
  74. If you encounter an empty-element tag (aka a self-closing tag,
  75. like HTML's <br> tag), call handle_starttag and then
  76. handle_endtag.
  77. """
  78. ROOT_TAG_NAME = '[document]'
  79.  
  80. # If the end-user gives no indication which tree builder they
  81. # want, look for one with these features.
  82. DEFAULT_BUILDER_FEATURES = ['html', 'fast']
  83.  
  84. ASCII_SPACES = '\x20\x0a\x09\x0c\x0d'
  85.  
  86. NO_PARSER_SPECIFIED_WARNING = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features=\"%(parser)s\"' to the BeautifulSoup constructor.\n"
  87.  
  88. def __init__(self, markup="", features=None, builder=None,
  89. parse_only=None, from_encoding=None, exclude_encodings=None,
  90. **kwargs):
  91. """Constructor.
  92.  
  93. :param markup: A string or a file-like object representing
  94. markup to be parsed.
  95.  
  96. :param features: Desirable features of the parser to be used. This
  97. may be the name of a specific parser ("lxml", "lxml-xml",
  98. "html.parser", or "html5lib") or it may be the type of markup
  99. to be used ("html", "html5", "xml"). It's recommended that you
  100. name a specific parser, so that Beautiful Soup gives you the
  101. same results across platforms and virtual environments.
  102.  
  103. :param builder: A specific TreeBuilder to use instead of looking one
  104. up based on `features`. You shouldn't need to use this.
  105.  
  106. :param parse_only: A SoupStrainer. Only parts of the document
  107. matching the SoupStrainer will be considered. This is useful
  108. when parsing part of a document that would otherwise be too
  109. large to fit into memory.
  110.  
  111. :param from_encoding: A string indicating the encoding of the
  112. document to be parsed. Pass this in if Beautiful Soup is
  113. guessing wrongly about the document's encoding.
  114.  
  115. :param exclude_encodings: A list of strings indicating
  116. encodings known to be wrong. Pass this in if you don't know
  117. the document's encoding but you know Beautiful Soup's guess is
  118. wrong.
  119.  
  120. :param kwargs: For backwards compatibility purposes, the
  121. constructor accepts certain keyword arguments used in
  122. Beautiful Soup 3. None of these arguments do anything in
  123. Beautiful Soup 4 and there's no need to actually pass keyword
  124. arguments into the constructor.
  125. """
  126.  
  127. if 'convertEntities' in kwargs:
  128. warnings.warn(
  129. "BS4 does not respect the convertEntities argument to the "
  130. "BeautifulSoup constructor. Entities are always converted "
  131. "to Unicode characters.")
  132.  
  133. if 'markupMassage' in kwargs:
  134. del kwargs['markupMassage']
  135. warnings.warn(
  136. "BS4 does not respect the markupMassage argument to the "
  137. "BeautifulSoup constructor. The tree builder is responsible "
  138. "for any necessary markup massage.")
  139.  
  140. if 'smartQuotesTo' in kwargs:
  141. del kwargs['smartQuotesTo']
  142. warnings.warn(
  143. "BS4 does not respect the smartQuotesTo argument to the "
  144. "BeautifulSoup constructor. Smart quotes are always converted "
  145. "to Unicode characters.")
  146.  
  147. if 'selfClosingTags' in kwargs:
  148. del kwargs['selfClosingTags']
  149. warnings.warn(
  150. "BS4 does not respect the selfClosingTags argument to the "
  151. "BeautifulSoup constructor. The tree builder is responsible "
  152. "for understanding self-closing tags.")
  153.  
  154. if 'isHTML' in kwargs:
  155. del kwargs['isHTML']
  156. warnings.warn(
  157. "BS4 does not respect the isHTML argument to the "
  158. "BeautifulSoup constructor. Suggest you use "
  159. "features='lxml' for HTML and features='lxml-xml' for "
  160. "XML.")
  161.  
  162. def deprecated_argument(old_name, new_name):
  163. if old_name in kwargs:
  164. warnings.warn(
  165. 'The "%s" argument to the BeautifulSoup constructor '
  166. 'has been renamed to "%s."' % (old_name, new_name))
  167. value = kwargs[old_name]
  168. del kwargs[old_name]
  169. return value
  170. return None
  171.  
  172. parse_only = parse_only or deprecated_argument(
  173. "parseOnlyThese", "parse_only")
  174.  
  175. from_encoding = from_encoding or deprecated_argument(
  176. "fromEncoding", "from_encoding")
  177.  
  178. if from_encoding and isinstance(markup, str):
  179. warnings.warn("You provided Unicode markup but also provided a value for from_encoding. Your from_encoding will be ignored.")
  180. from_encoding = None
  181.  
  182. if len(kwargs) > 0:
  183. arg = list(kwargs.keys()).pop()
  184. raise TypeError(
  185. "__init__() got an unexpected keyword argument '%s'" % arg)
  186.  
  187. if builder is None:
  188. original_features = features
  189. if isinstance(features, str):
  190. features = [features]
  191. if features is None or len(features) == 0:
  192. features = self.DEFAULT_BUILDER_FEATURES
  193. builder_class = builder_registry.lookup(*features)
  194. if builder_class is None:
  195. raise FeatureNotFound(
  196. "Couldn't find a tree builder with the features you "
  197. "requested: %s. Do you need to install a parser library?"
  198. % ",".join(features))
  199. builder = builder_class()
  200. if not (original_features == builder.NAME or
  201. original_features in builder.ALTERNATE_NAMES):
  202. if builder.is_xml:
  203. markup_type = "XML"
  204. else:
  205. markup_type = "HTML"
  206.  
  207. # This code adapted from warnings.py so that we get the same line
  208. # of code as our warnings.warn() call gets, even if the answer is wrong
  209. # (as it may be in a multithreading situation).
  210. caller = None
  211. try:
  212. caller = sys._getframe(1)
  213. except ValueError:
  214. pass
  215. if caller:
  216. globals = caller.f_globals
  217. line_number = caller.f_lineno
  218. else:
  219. globals = sys.__dict__
  220. line_number= 1
  221. filename = globals.get('__file__')
  222. if filename:
  223. fnl = filename.lower()
  224. if fnl.endswith((".pyc", ".pyo")):
  225. filename = filename[:-1]
  226. if filename:
  227. # If there is no filename at all, the user is most likely in a REPL,
  228. # and the warning is not necessary.
  229. values = dict(
  230. filename=filename,
  231. line_number=line_number,
  232. parser=builder.NAME,
  233. markup_type=markup_type
  234. )
  235. warnings.warn(self.NO_PARSER_SPECIFIED_WARNING % values, stacklevel=2)
  236.  
  237. self.builder = builder
  238. self.is_xml = builder.is_xml
  239. self.known_xml = self.is_xml
  240. self.builder.soup = self
  241.  
  242. self.parse_only = parse_only
  243.  
  244. if hasattr(markup, 'read'): # It's a file-type object.
  245. markup = markup.read()
  246. elif len(markup) <= 256 and (
  247. (isinstance(markup, bytes) and not b'<' in markup)
  248. or (isinstance(markup, str) and not '<' in markup)
  249. ):
  250. # Print out warnings for a couple beginner problems
  251. # involving passing non-markup to Beautiful Soup.
  252. # Beautiful Soup will still parse the input as markup,
  253. # just in case that's what the user really wants.
  254. if (isinstance(markup, str)
  255. and not os.path.supports_unicode_filenames):
  256. possible_filename = markup.encode("utf8")
  257. else:
  258. possible_filename = markup
  259. is_file = False
  260. try:
  261. is_file = os.path.exists(possible_filename)
  262. except Exception as e:
  263. # This is almost certainly a problem involving
  264. # characters not valid in filenames on this
  265. # system. Just let it go.
  266. pass
  267. if is_file:
  268. if isinstance(markup, str):
  269. markup = markup.encode("utf8")
  270. warnings.warn(
  271. '"%s" looks like a filename, not markup. You should'
  272. ' probably open this file and pass the filehandle into'
  273. ' Beautiful Soup.' % markup)
  274. self._check_markup_is_url(markup)
  275.  
  276. for (self.markup, self.original_encoding, self.declared_html_encoding,
  277. self.contains_replacement_characters) in (
  278. self.builder.prepare_markup(
  279. markup, from_encoding, exclude_encodings=exclude_encodings)):
  280. self.reset()
  281. try:
  282. self._feed()
  283. break
  284. except ParserRejectedMarkup:
  285. pass
  286.  
  287. # Clear out the markup and remove the builder's circular
  288. # reference to this object.
  289. self.markup = None
  290. self.builder.soup = None
  291.  
  292. def __copy__(self):
  293. copy = type(self)(
  294. self.encode('utf-8'), builder=self.builder, from_encoding='utf-8'
  295. )
  296.  
  297. # Although we encoded the tree to UTF-8, that may not have
  298. # been the encoding of the original markup. Set the copy's
  299. # .original_encoding to reflect the original object's
  300. # .original_encoding.
  301. copy.original_encoding = self.original_encoding
  302. return copy
  303.  
  304. def __getstate__(self):
  305. # Frequently a tree builder can't be pickled.
  306. d = dict(self.__dict__)
  307. if 'builder' in d and not self.builder.picklable:
  308. d['builder'] = None
  309. return d
  310.  
  311. @staticmethod
  312. def _check_markup_is_url(markup):
  313. """
  314. Check if markup looks like it's actually a url and raise a warning
  315. if so. Markup can be unicode or str (py2) / bytes (py3).
  316. """
  317. if isinstance(markup, bytes):
  318. space = b' '
  319. cant_start_with = (b"http:", b"https:")
  320. elif isinstance(markup, str):
  321. space = ' '
  322. cant_start_with = ("http:", "https:")
  323. else:
  324. return
  325.  
  326. if any(markup.startswith(prefix) for prefix in cant_start_with):
  327. if not space in markup:
  328. if isinstance(markup, bytes):
  329. decoded_markup = markup.decode('utf-8', 'replace')
  330. else:
  331. decoded_markup = markup
  332. warnings.warn(
  333. '"%s" looks like a URL. Beautiful Soup is not an'
  334. ' HTTP client. You should probably use an HTTP client like'
  335. ' requests to get the document behind the URL, and feed'
  336. ' that document to Beautiful Soup.' % decoded_markup
  337. )
  338.  
  339. def _feed(self):
  340. # Convert the document to Unicode.
  341. self.builder.reset()
  342.  
  343. self.builder.feed(self.markup)
  344. # Close out any unfinished strings and close all the open tags.
  345. self.endData()
  346. while self.currentTag.name != self.ROOT_TAG_NAME:
  347. self.popTag()
  348.  
  349. def reset(self):
  350. Tag.__init__(self, self, self.builder, self.ROOT_TAG_NAME)
  351. self.hidden = 1
  352. self.builder.reset()
  353. self.current_data = []
  354. self.currentTag = None
  355. self.tagStack = []
  356. self.preserve_whitespace_tag_stack = []
  357. self.pushTag(self)
  358.  
  359. def new_tag(self, name, namespace=None, nsprefix=None, attrs={}, **kwattrs):
  360. """Create a new tag associated with this soup."""
  361. kwattrs.update(attrs)
  362. return Tag(None, self.builder, name, namespace, nsprefix, kwattrs)
  363.  
  364. def new_string(self, s, subclass=NavigableString):
  365. """Create a new NavigableString associated with this soup."""
  366. return subclass(s)
  367.  
  368. def insert_before(self, successor):
  369. raise NotImplementedError("BeautifulSoup objects don't support insert_before().")
  370.  
  371. def insert_after(self, successor):
  372. raise NotImplementedError("BeautifulSoup objects don't support insert_after().")
  373.  
  374. def popTag(self):
  375. tag = self.tagStack.pop()
  376. if self.preserve_whitespace_tag_stack and tag == self.preserve_whitespace_tag_stack[-1]:
  377. self.preserve_whitespace_tag_stack.pop()
  378. #print "Pop", tag.name
  379. if self.tagStack:
  380. self.currentTag = self.tagStack[-1]
  381. return self.currentTag
  382.  
  383. def pushTag(self, tag):
  384. #print "Push", tag.name
  385. if self.currentTag:
  386. self.currentTag.contents.append(tag)
  387. self.tagStack.append(tag)
  388. self.currentTag = self.tagStack[-1]
  389. if tag.name in self.builder.preserve_whitespace_tags:
  390. self.preserve_whitespace_tag_stack.append(tag)
  391.  
  392. def endData(self, containerClass=NavigableString):
  393. if self.current_data:
  394. current_data = ''.join(self.current_data)
  395. # If whitespace is not preserved, and this string contains
  396. # nothing but ASCII spaces, replace it with a single space
  397. # or newline.
  398. if not self.preserve_whitespace_tag_stack:
  399. strippable = True
  400. for i in current_data:
  401. if i not in self.ASCII_SPACES:
  402. strippable = False
  403. break
  404. if strippable:
  405. if '\n' in current_data:
  406. current_data = '\n'
  407. else:
  408. current_data = ' '
  409.  
  410. # Reset the data collector.
  411. self.current_data = []
  412.  
  413. # Should we add this string to the tree at all?
  414. if self.parse_only and len(self.tagStack) <= 1 and \
  415. (not self.parse_only.text or \
  416. not self.parse_only.search(current_data)):
  417. return
  418.  
  419. o = containerClass(current_data)
  420. self.object_was_parsed(o)
  421.  
  422. def object_was_parsed(self, o, parent=None, most_recent_element=None):
  423. """Add an object to the parse tree."""
  424. parent = parent or self.currentTag
  425. previous_element = most_recent_element or self._most_recent_element
  426.  
  427. next_element = previous_sibling = next_sibling = None
  428. if isinstance(o, Tag):
  429. next_element = o.next_element
  430. next_sibling = o.next_sibling
  431. previous_sibling = o.previous_sibling
  432. if not previous_element:
  433. previous_element = o.previous_element
  434.  
  435. o.setup(parent, previous_element, next_element, previous_sibling, next_sibling)
  436.  
  437. self._most_recent_element = o
  438. parent.contents.append(o)
  439.  
  440. if parent.next_sibling:
  441. # This node is being inserted into an element that has
  442. # already been parsed. Deal with any dangling references.
  443. index = len(parent.contents)-1
  444. while index >= 0:
  445. if parent.contents[index] is o:
  446. break
  447. index -= 1
  448. else:
  449. raise ValueError(
  450. "Error building tree: supposedly %r was inserted "
  451. "into %r after the fact, but I don't see it!" % (
  452. o, parent
  453. )
  454. )
  455. if index == 0:
  456. previous_element = parent
  457. previous_sibling = None
  458. else:
  459. previous_element = previous_sibling = parent.contents[index-1]
  460. if index == len(parent.contents)-1:
  461. next_element = parent.next_sibling
  462. next_sibling = None
  463. else:
  464. next_element = next_sibling = parent.contents[index+1]
  465.  
  466. o.previous_element = previous_element
  467. if previous_element:
  468. previous_element.next_element = o
  469. o.next_element = next_element
  470. if next_element:
  471. next_element.previous_element = o
  472. o.next_sibling = next_sibling
  473. if next_sibling:
  474. next_sibling.previous_sibling = o
  475. o.previous_sibling = previous_sibling
  476. if previous_sibling:
  477. previous_sibling.next_sibling = o
  478.  
  479. def _popToTag(self, name, nsprefix=None, inclusivePop=True):
  480. """Pops the tag stack up to and including the most recent
  481. instance of the given tag. If inclusivePop is false, pops the tag
  482. stack up to but *not* including the most recent instqance of
  483. the given tag."""
  484. #print "Popping to %s" % name
  485. if name == self.ROOT_TAG_NAME:
  486. # The BeautifulSoup object itself can never be popped.
  487. return
  488.  
  489. most_recently_popped = None
  490.  
  491. stack_size = len(self.tagStack)
  492. for i in range(stack_size - 1, 0, -1):
  493. t = self.tagStack[i]
  494. if (name == t.name and nsprefix == t.prefix):
  495. if inclusivePop:
  496. most_recently_popped = self.popTag()
  497. break
  498. most_recently_popped = self.popTag()
  499.  
  500. return most_recently_popped
  501.  
  502. def handle_starttag(self, name, namespace, nsprefix, attrs):
  503. """Push a start tag on to the stack.
  504.  
  505. If this method returns None, the tag was rejected by the
  506. SoupStrainer. You should proceed as if the tag had not occurred
  507. in the document. For instance, if this was a self-closing tag,
  508. don't call handle_endtag.
  509. """
  510.  
  511. # print "Start tag %s: %s" % (name, attrs)
  512. self.endData()
  513.  
  514. if (self.parse_only and len(self.tagStack) <= 1
  515. and (self.parse_only.text
  516. or not self.parse_only.search_tag(name, attrs))):
  517. return None
  518.  
  519. tag = Tag(self, self.builder, name, namespace, nsprefix, attrs,
  520. self.currentTag, self._most_recent_element)
  521. if tag is None:
  522. return tag
  523. if self._most_recent_element:
  524. self._most_recent_element.next_element = tag
  525. self._most_recent_element = tag
  526. self.pushTag(tag)
  527. return tag
  528.  
  529. def handle_endtag(self, name, nsprefix=None):
  530. #print "End tag: " + name
  531. self.endData()
  532. self._popToTag(name, nsprefix)
  533.  
  534. def handle_data(self, data):
  535. self.current_data.append(data)
  536.  
  537. def decode(self, pretty_print=False,
  538. eventual_encoding=DEFAULT_OUTPUT_ENCODING,
  539. formatter="minimal"):
  540. """Returns a string or Unicode representation of this document.
  541. To get Unicode, pass None for encoding."""
  542.  
  543. if self.is_xml:
  544. # Print the XML declaration
  545. encoding_part = ''
  546. if eventual_encoding != None:
  547. encoding_part = ' encoding="%s"' % eventual_encoding
  548. prefix = '<?xml version="1.0"%s?>\n' % encoding_part
  549. else:
  550. prefix = ''
  551. if not pretty_print:
  552. indent_level = None
  553. else:
  554. indent_level = 0
  555. return prefix + super(BeautifulSoup, self).decode(
  556. indent_level, eventual_encoding, formatter)
  557.  
  558. # Alias to make it easier to type import: 'from bs4 import _soup'
  559. _s = BeautifulSoup
  560. _soup = BeautifulSoup
  561.  
  562. class BeautifulStoneSoup(BeautifulSoup):
  563. """Deprecated interface to an XML parser."""
  564.  
  565. def __init__(self, *args, **kwargs):
  566. kwargs['features'] = 'xml'
  567. warnings.warn(
  568. 'The BeautifulStoneSoup class is deprecated. Instead of using '
  569. 'it, pass features="xml" into the BeautifulSoup constructor.')
  570. super(BeautifulStoneSoup, self).__init__(*args, **kwargs)
  571.  
  572. class StopParsing(Exception):
  573. pass
  574.  
  575. class FeatureNotFound(ValueError):
  576. pass
  577.  
  578. #By default, act as an HTML pretty-printer.
  579. if __name__ == '__main__':
  580. import sys
  581. soup = BeautifulSoup(sys.stdin)
  582. print(soup.prettify())

find_all()源码:

  1. # Use of this source code is governed by a BSD-style license that can be
  2. # found in the LICENSE file.
  3. __license__ = "MIT"
  4.  
  5. try:
  6. from collections.abc import Callable # Python 3.6
  7. except ImportError as e:
  8. from collections import Callable
  9. import re
  10. import shlex
  11. import sys
  12. import warnings
  13. from bs4.dammit import EntitySubstitution
  14.  
  15. DEFAULT_OUTPUT_ENCODING = "utf-8"
  16. PY3K = (sys.version_info[0] > 2)
  17.  
  18. whitespace_re = re.compile(r"\s+")
  19.  
  20. def _alias(attr):
  21. """Alias one attribute name to another for backward compatibility"""
  22. @property
  23. def alias(self):
  24. return getattr(self, attr)
  25.  
  26. @alias.setter
  27. def alias(self):
  28. return setattr(self, attr)
  29. return alias
  30.  
  31. class NamespacedAttribute(str):
  32.  
  33. def __new__(cls, prefix, name, namespace=None):
  34. if name is None:
  35. obj = str.__new__(cls, prefix)
  36. elif prefix is None:
  37. # Not really namespaced.
  38. obj = str.__new__(cls, name)
  39. else:
  40. obj = str.__new__(cls, prefix + ":" + name)
  41. obj.prefix = prefix
  42. obj.name = name
  43. obj.namespace = namespace
  44. return obj
  45.  
  46. class AttributeValueWithCharsetSubstitution(str):
  47. """A stand-in object for a character encoding specified in HTML."""
  48.  
  49. class CharsetMetaAttributeValue(AttributeValueWithCharsetSubstitution):
  50. """A generic stand-in for the value of a meta tag's 'charset' attribute.
  51.  
  52. When Beautiful Soup parses the markup '<meta charset="utf8">', the
  53. value of the 'charset' attribute will be one of these objects.
  54. """
  55.  
  56. def __new__(cls, original_value):
  57. obj = str.__new__(cls, original_value)
  58. obj.original_value = original_value
  59. return obj
  60.  
  61. def encode(self, encoding):
  62. return encoding
  63.  
  64. class ContentMetaAttributeValue(AttributeValueWithCharsetSubstitution):
  65. """A generic stand-in for the value of a meta tag's 'content' attribute.
  66.  
  67. When Beautiful Soup parses the markup:
  68. <meta http-equiv="content-type" content="text/html; charset=utf8">
  69.  
  70. The value of the 'content' attribute will be one of these objects.
  71. """
  72.  
  73. CHARSET_RE = re.compile(r"((^|;)\s*charset=)([^;]*)", re.M)
  74.  
  75. def __new__(cls, original_value):
  76. match = cls.CHARSET_RE.search(original_value)
  77. if match is None:
  78. # No substitution necessary.
  79. return str.__new__(str, original_value)
  80.  
  81. obj = str.__new__(cls, original_value)
  82. obj.original_value = original_value
  83. return obj
  84.  
  85. def encode(self, encoding):
  86. def rewrite(match):
  87. return match.group(1) + encoding
  88. return self.CHARSET_RE.sub(rewrite, self.original_value)
  89.  
  90. class HTMLAwareEntitySubstitution(EntitySubstitution):
  91.  
  92. """Entity substitution rules that are aware of some HTML quirks.
  93.  
  94. Specifically, the contents of <script> and <style> tags should not
  95. undergo entity substitution.
  96.  
  97. Incoming NavigableString objects are checked to see if they're the
  98. direct children of a <script> or <style> tag.
  99. """
  100.  
  101. cdata_containing_tags = set(["script", "style"])
  102.  
  103. preformatted_tags = set(["pre"])
  104.  
  105. preserve_whitespace_tags = set(['pre', 'textarea'])
  106.  
  107. @classmethod
  108. def _substitute_if_appropriate(cls, ns, f):
  109. if (isinstance(ns, NavigableString)
  110. and ns.parent is not None
  111. and ns.parent.name in cls.cdata_containing_tags):
  112. # Do nothing.
  113. return ns
  114. # Substitute.
  115. return f(ns)
  116.  
  117. @classmethod
  118. def substitute_html(cls, ns):
  119. return cls._substitute_if_appropriate(
  120. ns, EntitySubstitution.substitute_html)
  121.  
  122. @classmethod
  123. def substitute_xml(cls, ns):
  124. return cls._substitute_if_appropriate(
  125. ns, EntitySubstitution.substitute_xml)
  126.  
  127. class Formatter(object):
  128. """Contains information about how to format a parse tree."""
  129.  
  130. # By default, represent void elements as <tag/> rather than <tag>
  131. void_element_close_prefix = '/'
  132.  
  133. def substitute_entities(self, *args, **kwargs):
  134. """Transform certain characters into named entities."""
  135. raise NotImplementedError()
  136.  
  137. class HTMLFormatter(Formatter):
  138. """The default HTML formatter."""
  139. def substitute(self, *args, **kwargs):
  140. return HTMLAwareEntitySubstitution.substitute_html(*args, **kwargs)
  141.  
  142. class MinimalHTMLFormatter(Formatter):
  143. """A minimal HTML formatter."""
  144. def substitute(self, *args, **kwargs):
  145. return HTMLAwareEntitySubstitution.substitute_xml(*args, **kwargs)
  146.  
  147. class HTML5Formatter(HTMLFormatter):
  148. """An HTML formatter that omits the slash in a void tag."""
  149. void_element_close_prefix = None
  150.  
  151. class XMLFormatter(Formatter):
  152. """Substitute only the essential XML entities."""
  153. def substitute(self, *args, **kwargs):
  154. return EntitySubstitution.substitute_xml(*args, **kwargs)
  155.  
  156. class HTMLXMLFormatter(Formatter):
  157. """Format XML using HTML rules."""
  158. def substitute(self, *args, **kwargs):
  159. return HTMLAwareEntitySubstitution.substitute_html(*args, **kwargs)
  160.  
  161. class PageElement(object):
  162. """Contains the navigational information for some part of the page
  163. (either a tag or a piece of text)"""
  164.  
  165. # There are five possible values for the "formatter" argument passed in
  166. # to methods like encode() and prettify():
  167. #
  168. # "html" - All Unicode characters with corresponding HTML entities
  169. # are converted to those entities on output.
  170. # "html5" - The same as "html", but empty void tags are represented as
  171. # <tag> rather than <tag/>
  172. # "minimal" - Bare ampersands and angle brackets are converted to
  173. # XML entities: &amp; &lt; &gt;
  174. # None - The null formatter. Unicode characters are never
  175. # converted to entities. This is not recommended, but it's
  176. # faster than "minimal".
  177. # A callable function - it will be called on every string that needs to undergo entity substitution.
  178. # A Formatter instance - Formatter.substitute(string) will be called on every string that
  179. # needs to undergo entity substitution.
  180. #
  181.  
  182. # In an HTML document, the default "html", "html5", and "minimal"
  183. # functions will leave the contents of <script> and <style> tags
  184. # alone. For an XML document, all tags will be given the same
  185. # treatment.
  186.  
  187. HTML_FORMATTERS = {
  188. "html" : HTMLFormatter(),
  189. "html5" : HTML5Formatter(),
  190. "minimal" : MinimalHTMLFormatter(),
  191. None : None
  192. }
  193.  
  194. XML_FORMATTERS = {
  195. "html" : HTMLXMLFormatter(),
  196. "minimal" : XMLFormatter(),
  197. None : None
  198. }
  199.  
  200. def format_string(self, s, formatter='minimal'):
  201. """Format the given string using the given formatter."""
  202. if isinstance(formatter, str):
  203. formatter = self._formatter_for_name(formatter)
  204. if formatter is None:
  205. output = s
  206. else:
  207. if callable(formatter):
  208. # Backwards compatibility -- you used to pass in a formatting method.
  209. output = formatter(s)
  210. else:
  211. output = formatter.substitute(s)
  212. return output
  213.  
  214. @property
  215. def _is_xml(self):
  216. """Is this element part of an XML tree or an HTML tree?
  217.  
  218. This is used when mapping a formatter name ("minimal") to an
  219. appropriate function (one that performs entity-substitution on
  220. the contents of <script> and <style> tags, or not). It can be
  221. inefficient, but it should be called very rarely.
  222. """
  223. if self.known_xml is not None:
  224. # Most of the time we will have determined this when the
  225. # document is parsed.
  226. return self.known_xml
  227.  
  228. # Otherwise, it's likely that this element was created by
  229. # direct invocation of the constructor from within the user's
  230. # Python code.
  231. if self.parent is None:
  232. # This is the top-level object. It should have .known_xml set
  233. # from tree creation. If not, take a guess--BS is usually
  234. # used on HTML markup.
  235. return getattr(self, 'is_xml', False)
  236. return self.parent._is_xml
  237.  
  238. def _formatter_for_name(self, name):
  239. "Look up a formatter function based on its name and the tree."
  240. if self._is_xml:
  241. return self.XML_FORMATTERS.get(name, XMLFormatter())
  242. else:
  243. return self.HTML_FORMATTERS.get(name, HTMLFormatter())
  244.  
  245. def setup(self, parent=None, previous_element=None, next_element=None,
  246. previous_sibling=None, next_sibling=None):
  247. """Sets up the initial relations between this element and
  248. other elements."""
  249. self.parent = parent
  250.  
  251. self.previous_element = previous_element
  252. if previous_element is not None:
  253. self.previous_element.next_element = self
  254.  
  255. self.next_element = next_element
  256. if self.next_element:
  257. self.next_element.previous_element = self
  258.  
  259. self.next_sibling = next_sibling
  260. if self.next_sibling:
  261. self.next_sibling.previous_sibling = self
  262.  
  263. if (not previous_sibling
  264. and self.parent is not None and self.parent.contents):
  265. previous_sibling = self.parent.contents[-1]
  266.  
  267. self.previous_sibling = previous_sibling
  268. if previous_sibling:
  269. self.previous_sibling.next_sibling = self
  270.  
  271. nextSibling = _alias("next_sibling") # BS3
  272. previousSibling = _alias("previous_sibling") # BS3
  273.  
  274. def replace_with(self, replace_with):
  275. if not self.parent:
  276. raise ValueError(
  277. "Cannot replace one element with another when the"
  278. "element to be replaced is not part of a tree.")
  279. if replace_with is self:
  280. return
  281. if replace_with is self.parent:
  282. raise ValueError("Cannot replace a Tag with its parent.")
  283. old_parent = self.parent
  284. my_index = self.parent.index(self)
  285. self.extract()
  286. old_parent.insert(my_index, replace_with)
  287. return self
  288. replaceWith = replace_with # BS3
  289.  
  290. def unwrap(self):
  291. my_parent = self.parent
  292. if not self.parent:
  293. raise ValueError(
  294. "Cannot replace an element with its contents when that"
  295. "element is not part of a tree.")
  296. my_index = self.parent.index(self)
  297. self.extract()
  298. for child in reversed(self.contents[:]):
  299. my_parent.insert(my_index, child)
  300. return self
  301. replace_with_children = unwrap
  302. replaceWithChildren = unwrap # BS3
  303.  
  304. def wrap(self, wrap_inside):
  305. me = self.replace_with(wrap_inside)
  306. wrap_inside.append(me)
  307. return wrap_inside
  308.  
  309. def extract(self):
  310. """Destructively rips this element out of the tree."""
  311. if self.parent is not None:
  312. del self.parent.contents[self.parent.index(self)]
  313.  
  314. #Find the two elements that would be next to each other if
  315. #this element (and any children) hadn't been parsed. Connect
  316. #the two.
  317. last_child = self._last_descendant()
  318. next_element = last_child.next_element
  319.  
  320. if (self.previous_element is not None and
  321. self.previous_element is not next_element):
  322. self.previous_element.next_element = next_element
  323. if next_element is not None and next_element is not self.previous_element:
  324. next_element.previous_element = self.previous_element
  325. self.previous_element = None
  326. last_child.next_element = None
  327.  
  328. self.parent = None
  329. if (self.previous_sibling is not None
  330. and self.previous_sibling is not self.next_sibling):
  331. self.previous_sibling.next_sibling = self.next_sibling
  332. if (self.next_sibling is not None
  333. and self.next_sibling is not self.previous_sibling):
  334. self.next_sibling.previous_sibling = self.previous_sibling
  335. self.previous_sibling = self.next_sibling = None
  336. return self
  337.  
  338. def _last_descendant(self, is_initialized=True, accept_self=True):
  339. "Finds the last element beneath this object to be parsed."
  340. if is_initialized and self.next_sibling:
  341. last_child = self.next_sibling.previous_element
  342. else:
  343. last_child = self
  344. while isinstance(last_child, Tag) and last_child.contents:
  345. last_child = last_child.contents[-1]
  346. if not accept_self and last_child is self:
  347. last_child = None
  348. return last_child
  349. # BS3: Not part of the API!
  350. _lastRecursiveChild = _last_descendant
  351.  
  352. def insert(self, position, new_child):
  353. if new_child is None:
  354. raise ValueError("Cannot insert None into a tag.")
  355. if new_child is self:
  356. raise ValueError("Cannot insert a tag into itself.")
  357. if (isinstance(new_child, str)
  358. and not isinstance(new_child, NavigableString)):
  359. new_child = NavigableString(new_child)
  360.  
  361. from bs4 import BeautifulSoup
  362. if isinstance(new_child, BeautifulSoup):
  363. # We don't want to end up with a situation where one BeautifulSoup
  364. # object contains another. Insert the children one at a time.
  365. for subchild in list(new_child.contents):
  366. self.insert(position, subchild)
  367. position += 1
  368. return
  369. position = min(position, len(self.contents))
  370. if hasattr(new_child, 'parent') and new_child.parent is not None:
  371. # We're 'inserting' an element that's already one
  372. # of this object's children.
  373. if new_child.parent is self:
  374. current_index = self.index(new_child)
  375. if current_index < position:
  376. # We're moving this element further down the list
  377. # of this object's children. That means that when
  378. # we extract this element, our target index will
  379. # jump down one.
  380. position -= 1
  381. new_child.extract()
  382.  
  383. new_child.parent = self
  384. previous_child = None
  385. if position == 0:
  386. new_child.previous_sibling = None
  387. new_child.previous_element = self
  388. else:
  389. previous_child = self.contents[position - 1]
  390. new_child.previous_sibling = previous_child
  391. new_child.previous_sibling.next_sibling = new_child
  392. new_child.previous_element = previous_child._last_descendant(False)
  393. if new_child.previous_element is not None:
  394. new_child.previous_element.next_element = new_child
  395.  
  396. new_childs_last_element = new_child._last_descendant(False)
  397.  
  398. if position >= len(self.contents):
  399. new_child.next_sibling = None
  400.  
  401. parent = self
  402. parents_next_sibling = None
  403. while parents_next_sibling is None and parent is not None:
  404. parents_next_sibling = parent.next_sibling
  405. parent = parent.parent
  406. if parents_next_sibling is not None:
  407. # We found the element that comes next in the document.
  408. break
  409. if parents_next_sibling is not None:
  410. new_childs_last_element.next_element = parents_next_sibling
  411. else:
  412. # The last element of this tag is the last element in
  413. # the document.
  414. new_childs_last_element.next_element = None
  415. else:
  416. next_child = self.contents[position]
  417. new_child.next_sibling = next_child
  418. if new_child.next_sibling is not None:
  419. new_child.next_sibling.previous_sibling = new_child
  420. new_childs_last_element.next_element = next_child
  421.  
  422. if new_childs_last_element.next_element is not None:
  423. new_childs_last_element.next_element.previous_element = new_childs_last_element
  424. self.contents.insert(position, new_child)
  425.  
  426. def append(self, tag):
  427. """Appends the given tag to the contents of this tag."""
  428. self.insert(len(self.contents), tag)
  429.  
  430. def insert_before(self, predecessor):
  431. """Makes the given element the immediate predecessor of this one.
  432.  
  433. The two elements will have the same parent, and the given element
  434. will be immediately before this one.
  435. """
  436. if self is predecessor:
  437. raise ValueError("Can't insert an element before itself.")
  438. parent = self.parent
  439. if parent is None:
  440. raise ValueError(
  441. "Element has no parent, so 'before' has no meaning.")
  442. # Extract first so that the index won't be screwed up if they
  443. # are siblings.
  444. if isinstance(predecessor, PageElement):
  445. predecessor.extract()
  446. index = parent.index(self)
  447. parent.insert(index, predecessor)
  448.  
  449. def insert_after(self, successor):
  450. """Makes the given element the immediate successor of this one.
  451.  
  452. The two elements will have the same parent, and the given element
  453. will be immediately after this one.
  454. """
  455. if self is successor:
  456. raise ValueError("Can't insert an element after itself.")
  457. parent = self.parent
  458. if parent is None:
  459. raise ValueError(
  460. "Element has no parent, so 'after' has no meaning.")
  461. # Extract first so that the index won't be screwed up if they
  462. # are siblings.
  463. if isinstance(successor, PageElement):
  464. successor.extract()
  465. index = parent.index(self)
  466. parent.insert(index+1, successor)
  467.  
  468. def find_next(self, name=None, attrs={}, text=None, **kwargs):
  469. """Returns the first item that matches the given criteria and
  470. appears after this Tag in the document."""
  471. return self._find_one(self.find_all_next, name, attrs, text, **kwargs)
  472. findNext = find_next # BS3
  473.  
  474. def find_all_next(self, name=None, attrs={}, text=None, limit=None,
  475. **kwargs):
  476. """Returns all items that match the given criteria and appear
  477. after this Tag in the document."""
  478. return self._find_all(name, attrs, text, limit, self.next_elements,
  479. **kwargs)
  480. findAllNext = find_all_next # BS3
  481.  
  482. def find_next_sibling(self, name=None, attrs={}, text=None, **kwargs):
  483. """Returns the closest sibling to this Tag that matches the
  484. given criteria and appears after this Tag in the document."""
  485. return self._find_one(self.find_next_siblings, name, attrs, text,
  486. **kwargs)
  487. findNextSibling = find_next_sibling # BS3
  488.  
  489. def find_next_siblings(self, name=None, attrs={}, text=None, limit=None,
  490. **kwargs):
  491. """Returns the siblings of this Tag that match the given
  492. criteria and appear after this Tag in the document."""
  493. return self._find_all(name, attrs, text, limit,
  494. self.next_siblings, **kwargs)
  495. findNextSiblings = find_next_siblings # BS3
  496. fetchNextSiblings = find_next_siblings # BS2
  497.  
  498. def find_previous(self, name=None, attrs={}, text=None, **kwargs):
  499. """Returns the first item that matches the given criteria and
  500. appears before this Tag in the document."""
  501. return self._find_one(
  502. self.find_all_previous, name, attrs, text, **kwargs)
  503. findPrevious = find_previous # BS3
  504.  
  505. def find_all_previous(self, name=None, attrs={}, text=None, limit=None,
  506. **kwargs):
  507. """Returns all items that match the given criteria and appear
  508. before this Tag in the document."""
  509. return self._find_all(name, attrs, text, limit, self.previous_elements,
  510. **kwargs)
  511. findAllPrevious = find_all_previous # BS3
  512. fetchPrevious = find_all_previous # BS2
  513.  
  514. def find_previous_sibling(self, name=None, attrs={}, text=None, **kwargs):
  515. """Returns the closest sibling to this Tag that matches the
  516. given criteria and appears before this Tag in the document."""
  517. return self._find_one(self.find_previous_siblings, name, attrs, text,
  518. **kwargs)
  519. findPreviousSibling = find_previous_sibling # BS3
  520.  
  521. def find_previous_siblings(self, name=None, attrs={}, text=None,
  522. limit=None, **kwargs):
  523. """Returns the siblings of this Tag that match the given
  524. criteria and appear before this Tag in the document."""
  525. return self._find_all(name, attrs, text, limit,
  526. self.previous_siblings, **kwargs)
  527. findPreviousSiblings = find_previous_siblings # BS3
  528. fetchPreviousSiblings = find_previous_siblings # BS2
  529.  
  530. def find_parent(self, name=None, attrs={}, **kwargs):
  531. """Returns the closest parent of this Tag that matches the given
  532. criteria."""
  533. # NOTE: We can't use _find_one because findParents takes a different
  534. # set of arguments.
  535. r = None
  536. l = self.find_parents(name, attrs, 1, **kwargs)
  537. if l:
  538. r = l[0]
  539. return r
  540. findParent = find_parent # BS3
  541.  
  542. def find_parents(self, name=None, attrs={}, limit=None, **kwargs):
  543. """Returns the parents of this Tag that match the given
  544. criteria."""
  545.  
  546. return self._find_all(name, attrs, None, limit, self.parents,
  547. **kwargs)
  548. findParents = find_parents # BS3
  549. fetchParents = find_parents # BS2
  550.  
  551. @property
  552. def next(self):
  553. return self.next_element
  554.  
  555. @property
  556. def previous(self):
  557. return self.previous_element
  558.  
  559. #These methods do the real heavy lifting.
  560.  
  561. def _find_one(self, method, name, attrs, text, **kwargs):
  562. r = None
  563. l = method(name, attrs, text, 1, **kwargs)
  564. if l:
  565. r = l[0]
  566. return r
  567.  
  568. def _find_all(self, name, attrs, text, limit, generator, **kwargs):
  569. "Iterates over a generator looking for things that match."
  570.  
  571. if text is None and 'string' in kwargs:
  572. text = kwargs['string']
  573. del kwargs['string']
  574.  
  575. if isinstance(name, SoupStrainer):
  576. strainer = name
  577. else:
  578. strainer = SoupStrainer(name, attrs, text, **kwargs)
  579.  
  580. if text is None and not limit and not attrs and not kwargs:
  581. if name is True or name is None:
  582. # Optimization to find all tags.
  583. result = (element for element in generator
  584. if isinstance(element, Tag))
  585. return ResultSet(strainer, result)
  586. elif isinstance(name, str):
  587. # Optimization to find all tags with a given name.
  588. if name.count(':') == 1:
  589. # This is a name with a prefix. If this is a namespace-aware document,
  590. # we need to match the local name against tag.name. If not,
  591. # we need to match the fully-qualified name against tag.name.
  592. prefix, local_name = name.split(':', 1)
  593. else:
  594. prefix = None
  595. local_name = name
  596. result = (element for element in generator
  597. if isinstance(element, Tag)
  598. and (
  599. element.name == name
  600. ) or (
  601. element.name == local_name
  602. and (prefix is None or element.prefix == prefix)
  603. )
  604. )
  605. return ResultSet(strainer, result)
  606. results = ResultSet(strainer)
  607. while True:
  608. try:
  609. i = next(generator)
  610. except StopIteration:
  611. break
  612. if i:
  613. found = strainer.search(i)
  614. if found:
  615. results.append(found)
  616. if limit and len(results) >= limit:
  617. break
  618. return results
  619.  
  620. #These generators can be used to navigate starting from both
  621. #NavigableStrings and Tags.
  622. @property
  623. def next_elements(self):
  624. i = self.next_element
  625. while i is not None:
  626. yield i
  627. i = i.next_element
  628.  
  629. @property
  630. def next_siblings(self):
  631. i = self.next_sibling
  632. while i is not None:
  633. yield i
  634. i = i.next_sibling
  635.  
  636. @property
  637. def previous_elements(self):
  638. i = self.previous_element
  639. while i is not None:
  640. yield i
  641. i = i.previous_element
  642.  
  643. @property
  644. def previous_siblings(self):
  645. i = self.previous_sibling
  646. while i is not None:
  647. yield i
  648. i = i.previous_sibling
  649.  
  650. @property
  651. def parents(self):
  652. i = self.parent
  653. while i is not None:
  654. yield i
  655. i = i.parent
  656.  
  657. # Methods for supporting CSS selectors.
  658.  
  659. tag_name_re = re.compile('^[a-zA-Z0-9][-.a-zA-Z0-9:_]*$')
  660.  
  661. # /^([a-zA-Z0-9][-.a-zA-Z0-9:_]*)\[(\w+)([=~\|\^\$\*]?)=?"?([^\]"]*)"?\]$/
  662. # \---------------------------/ \---/\-------------/ \-------/
  663. # | | | |
  664. # | | | The value
  665. # | | ~,|,^,$,* or =
  666. # | Attribute
  667. # Tag
  668. attribselect_re = re.compile(
  669. r'^(?P<tag>[a-zA-Z0-9][-.a-zA-Z0-9:_]*)?\[(?P<attribute>[\w-]+)(?P<operator>[=~\|\^\$\*]?)' +
  670. r'=?"?(?P<value>[^\]"]*)"?\]$'
  671. )
  672.  
  673. def _attr_value_as_string(self, value, default=None):
  674. """Force an attribute value into a string representation.
  675.  
  676. A multi-valued attribute will be converted into a
  677. space-separated stirng.
  678. """
  679. value = self.get(value, default)
  680. if isinstance(value, list) or isinstance(value, tuple):
  681. value =" ".join(value)
  682. return value
  683.  
  684. def _tag_name_matches_and(self, function, tag_name):
  685. if not tag_name:
  686. return function
  687. else:
  688. def _match(tag):
  689. return tag.name == tag_name and function(tag)
  690. return _match
  691.  
  692. def _attribute_checker(self, operator, attribute, value=''):
  693. """Create a function that performs a CSS selector operation.
  694.  
  695. Takes an operator, attribute and optional value. Returns a
  696. function that will return True for elements that match that
  697. combination.
  698. """
  699. if operator == '=':
  700. # string representation of `attribute` is equal to `value`
  701. return lambda el: el._attr_value_as_string(attribute) == value
  702. elif operator == '~':
  703. # space-separated list representation of `attribute`
  704. # contains `value`
  705. def _includes_value(element):
  706. attribute_value = element.get(attribute, [])
  707. if not isinstance(attribute_value, list):
  708. attribute_value = attribute_value.split()
  709. return value in attribute_value
  710. return _includes_value
  711. elif operator == '^':
  712. # string representation of `attribute` starts with `value`
  713. return lambda el: el._attr_value_as_string(
  714. attribute, '').startswith(value)
  715. elif operator == '$':
  716. # string representation of `attribute` ends with `value`
  717. return lambda el: el._attr_value_as_string(
  718. attribute, '').endswith(value)
  719. elif operator == '*':
  720. # string representation of `attribute` contains `value`
  721. return lambda el: value in el._attr_value_as_string(attribute, '')
  722. elif operator == '|':
  723. # string representation of `attribute` is either exactly
  724. # `value` or starts with `value` and then a dash.
  725. def _is_or_starts_with_dash(element):
  726. attribute_value = element._attr_value_as_string(attribute, '')
  727. return (attribute_value == value or attribute_value.startswith(
  728. value + '-'))
  729. return _is_or_starts_with_dash
  730. else:
  731. return lambda el: el.has_attr(attribute)
  732.  
  733. # Old non-property versions of the generators, for backwards
  734. # compatibility with BS3.
  735. def nextGenerator(self):
  736. return self.next_elements
  737.  
  738. def nextSiblingGenerator(self):
  739. return self.next_siblings
  740.  
  741. def previousGenerator(self):
  742. return self.previous_elements
  743.  
  744. def previousSiblingGenerator(self):
  745. return self.previous_siblings
  746.  
  747. def parentGenerator(self):
  748. return self.parents
  749.  
  750. class NavigableString(str, PageElement):
  751.  
  752. PREFIX = ''
  753. SUFFIX = ''
  754.  
  755. # We can't tell just by looking at a string whether it's contained
  756. # in an XML document or an HTML document.
  757.  
  758. known_xml = None
  759.  
  760. def __new__(cls, value):
  761. """Create a new NavigableString.
  762.  
  763. When unpickling a NavigableString, this method is called with
  764. the string in DEFAULT_OUTPUT_ENCODING. That encoding needs to be
  765. passed in to the superclass's __new__ or the superclass won't know
  766. how to handle non-ASCII characters.
  767. """
  768. if isinstance(value, str):
  769. u = str.__new__(cls, value)
  770. else:
  771. u = str.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
  772. u.setup()
  773. return u
  774.  
  775. def __copy__(self):
  776. """A copy of a NavigableString has the same contents and class
  777. as the original, but it is not connected to the parse tree.
  778. """
  779. return type(self)(self)
  780.  
  781. def __getnewargs__(self):
  782. return (str(self),)
  783.  
  784. def __getattr__(self, attr):
  785. """text.string gives you text. This is for backwards
  786. compatibility for Navigable*String, but for CData* it lets you
  787. get the string without the CData wrapper."""
  788. if attr == 'string':
  789. return self
  790. else:
  791. raise AttributeError(
  792. "'%s' object has no attribute '%s'" % (
  793. self.__class__.__name__, attr))
  794.  
  795. def output_ready(self, formatter="minimal"):
  796. output = self.format_string(self, formatter)
  797. return self.PREFIX + output + self.SUFFIX
  798.  
  799. @property
  800. def name(self):
  801. return None
  802.  
  803. @name.setter
  804. def name(self, name):
  805. raise AttributeError("A NavigableString cannot be given a name.")
  806.  
  807. class PreformattedString(NavigableString):
  808. """A NavigableString not subject to the normal formatting rules.
  809.  
  810. The string will be passed into the formatter (to trigger side effects),
  811. but the return value will be ignored.
  812. """
  813.  
  814. def output_ready(self, formatter="minimal"):
  815. """CData strings are passed into the formatter.
  816. But the return value is ignored."""
  817. self.format_string(self, formatter)
  818. return self.PREFIX + self + self.SUFFIX
  819.  
  820. class CData(PreformattedString):
  821.  
  822. PREFIX = '<![CDATA['
  823. SUFFIX = ']]>'
  824.  
  825. class ProcessingInstruction(PreformattedString):
  826. """A SGML processing instruction."""
  827.  
  828. PREFIX = '<?'
  829. SUFFIX = '>'
  830.  
  831. class XMLProcessingInstruction(ProcessingInstruction):
  832. """An XML processing instruction."""
  833. PREFIX = '<?'
  834. SUFFIX = '?>'
  835.  
  836. class Comment(PreformattedString):
  837.  
  838. PREFIX = '<!--'
  839. SUFFIX = '-->'
  840.  
  841. class Declaration(PreformattedString):
  842. PREFIX = '<?'
  843. SUFFIX = '?>'
  844.  
  845. class Doctype(PreformattedString):
  846.  
  847. @classmethod
  848. def for_name_and_ids(cls, name, pub_id, system_id):
  849. value = name or ''
  850. if pub_id is not None:
  851. value += ' PUBLIC "%s"' % pub_id
  852. if system_id is not None:
  853. value += ' "%s"' % system_id
  854. elif system_id is not None:
  855. value += ' SYSTEM "%s"' % system_id
  856.  
  857. return Doctype(value)
  858.  
  859. PREFIX = '<!DOCTYPE '
  860. SUFFIX = '>\n'
  861.  
  862. class Tag(PageElement):
  863.  
  864. """Represents a found HTML tag with its attributes and contents."""
  865.  
  866. def __init__(self, parser=None, builder=None, name=None, namespace=None,
  867. prefix=None, attrs=None, parent=None, previous=None,
  868. is_xml=None):
  869. "Basic constructor."
  870.  
  871. if parser is None:
  872. self.parser_class = None
  873. else:
  874. # We don't actually store the parser object: that lets extracted
  875. # chunks be garbage-collected.
  876. self.parser_class = parser.__class__
  877. if name is None:
  878. raise ValueError("No value provided for new tag's name.")
  879. self.name = name
  880. self.namespace = namespace
  881. self.prefix = prefix
  882. if builder is not None:
  883. preserve_whitespace_tags = builder.preserve_whitespace_tags
  884. else:
  885. if is_xml:
  886. preserve_whitespace_tags = []
  887. else:
  888. preserve_whitespace_tags = HTMLAwareEntitySubstitution.preserve_whitespace_tags
  889. self.preserve_whitespace_tags = preserve_whitespace_tags
  890. if attrs is None:
  891. attrs = {}
  892. elif attrs:
  893. if builder is not None and builder.cdata_list_attributes:
  894. attrs = builder._replace_cdata_list_attribute_values(
  895. self.name, attrs)
  896. else:
  897. attrs = dict(attrs)
  898. else:
  899. attrs = dict(attrs)
  900.  
  901. # If possible, determine ahead of time whether this tag is an
  902. # XML tag.
  903. if builder:
  904. self.known_xml = builder.is_xml
  905. else:
  906. self.known_xml = is_xml
  907. self.attrs = attrs
  908. self.contents = []
  909. self.setup(parent, previous)
  910. self.hidden = False
  911.  
  912. # Set up any substitutions, such as the charset in a META tag.
  913. if builder is not None:
  914. builder.set_up_substitutions(self)
  915. self.can_be_empty_element = builder.can_be_empty_element(name)
  916. else:
  917. self.can_be_empty_element = False
  918.  
  919. parserClass = _alias("parser_class") # BS3
  920.  
  921. def __copy__(self):
  922. """A copy of a Tag is a new Tag, unconnected to the parse tree.
  923. Its contents are a copy of the old Tag's contents.
  924. """
  925. clone = type(self)(None, self.builder, self.name, self.namespace,
  926. self.prefix, self.attrs, is_xml=self._is_xml)
  927. for attr in ('can_be_empty_element', 'hidden'):
  928. setattr(clone, attr, getattr(self, attr))
  929. for child in self.contents:
  930. clone.append(child.__copy__())
  931. return clone
  932.  
  933. @property
  934. def is_empty_element(self):
  935. """Is this tag an empty-element tag? (aka a self-closing tag)
  936.  
  937. A tag that has contents is never an empty-element tag.
  938.  
  939. A tag that has no contents may or may not be an empty-element
  940. tag. It depends on the builder used to create the tag. If the
  941. builder has a designated list of empty-element tags, then only
  942. a tag whose name shows up in that list is considered an
  943. empty-element tag.
  944.  
  945. If the builder has no designated list of empty-element tags,
  946. then any tag with no contents is an empty-element tag.
  947. """
  948. return len(self.contents) == 0 and self.can_be_empty_element
  949. isSelfClosing = is_empty_element # BS3
  950.  
  951. @property
  952. def string(self):
  953. """Convenience property to get the single string within this tag.
  954.  
  955. :Return: If this tag has a single string child, return value
  956. is that string. If this tag has no children, or more than one
  957. child, return value is None. If this tag has one child tag,
  958. return value is the 'string' attribute of the child tag,
  959. recursively.
  960. """
  961. if len(self.contents) != 1:
  962. return None
  963. child = self.contents[0]
  964. if isinstance(child, NavigableString):
  965. return child
  966. return child.string
  967.  
  968. @string.setter
  969. def string(self, string):
  970. self.clear()
  971. self.append(string.__class__(string))
  972.  
  973. def _all_strings(self, strip=False, types=(NavigableString, CData)):
  974. """Yield all strings of certain classes, possibly stripping them.
  975.  
  976. By default, yields only NavigableString and CData objects. So
  977. no comments, processing instructions, etc.
  978. """
  979. for descendant in self.descendants:
  980. if (
  981. (types is None and not isinstance(descendant, NavigableString))
  982. or
  983. (types is not None and type(descendant) not in types)):
  984. continue
  985. if strip:
  986. descendant = descendant.strip()
  987. if len(descendant) == 0:
  988. continue
  989. yield descendant
  990.  
  991. strings = property(_all_strings)
  992.  
  993. @property
  994. def stripped_strings(self):
  995. for string in self._all_strings(True):
  996. yield string
  997.  
  998. def get_text(self, separator="", strip=False,
  999. types=(NavigableString, CData)):
  1000. """
  1001. Get all child strings, concatenated using the given separator.
  1002. """
  1003. return separator.join([s for s in self._all_strings(
  1004. strip, types=types)])
  1005. getText = get_text
  1006. text = property(get_text)
  1007.  
  1008. def decompose(self):
  1009. """Recursively destroys the contents of this tree."""
  1010. self.extract()
  1011. i = self
  1012. while i is not None:
  1013. next = i.next_element
  1014. i.__dict__.clear()
  1015. i.contents = []
  1016. i = next
  1017.  
  1018. def clear(self, decompose=False):
  1019. """
  1020. Extract all children. If decompose is True, decompose instead.
  1021. """
  1022. if decompose:
  1023. for element in self.contents[:]:
  1024. if isinstance(element, Tag):
  1025. element.decompose()
  1026. else:
  1027. element.extract()
  1028. else:
  1029. for element in self.contents[:]:
  1030. element.extract()
  1031.  
  1032. def index(self, element):
  1033. """
  1034. Find the index of a child by identity, not value. Avoids issues with
  1035. tag.contents.index(element) getting the index of equal elements.
  1036. """
  1037. for i, child in enumerate(self.contents):
  1038. if child is element:
  1039. return i
  1040. raise ValueError("Tag.index: element not in tag")
  1041.  
  1042. def get(self, key, default=None):
  1043. """Returns the value of the 'key' attribute for the tag, or
  1044. the value given for 'default' if it doesn't have that
  1045. attribute."""
  1046. return self.attrs.get(key, default)
  1047.  
  1048. def get_attribute_list(self, key, default=None):
  1049. """The same as get(), but always returns a list."""
  1050. value = self.get(key, default)
  1051. if not isinstance(value, list):
  1052. value = [value]
  1053. return value
  1054.  
  1055. def has_attr(self, key):
  1056. return key in self.attrs
  1057.  
  1058. def __hash__(self):
  1059. return str(self).__hash__()
  1060.  
  1061. def __getitem__(self, key):
  1062. """tag[key] returns the value of the 'key' attribute for the tag,
  1063. and throws an exception if it's not there."""
  1064. return self.attrs[key]
  1065.  
  1066. def __iter__(self):
  1067. "Iterating over a tag iterates over its contents."
  1068. return iter(self.contents)
  1069.  
  1070. def __len__(self):
  1071. "The length of a tag is the length of its list of contents."
  1072. return len(self.contents)
  1073.  
  1074. def __contains__(self, x):
  1075. return x in self.contents
  1076.  
  1077. def __bool__(self):
  1078. "A tag is non-None even if it has no contents."
  1079. return True
  1080.  
  1081. def __setitem__(self, key, value):
  1082. """Setting tag[key] sets the value of the 'key' attribute for the
  1083. tag."""
  1084. self.attrs[key] = value
  1085.  
  1086. def __delitem__(self, key):
  1087. "Deleting tag[key] deletes all 'key' attributes for the tag."
  1088. self.attrs.pop(key, None)
  1089.  
  1090. def __call__(self, *args, **kwargs):
  1091. """Calling a tag like a function is the same as calling its
  1092. find_all() method. Eg. tag('a') returns a list of all the A tags
  1093. found within this tag."""
  1094. return self.find_all(*args, **kwargs)
  1095.  
  1096. def __getattr__(self, tag):
  1097. #print "Getattr %s.%s" % (self.__class__, tag)
  1098. if len(tag) > 3 and tag.endswith('Tag'):
  1099. # BS3: soup.aTag -> "soup.find("a")
  1100. tag_name = tag[:-3]
  1101. warnings.warn(
  1102. '.%(name)sTag is deprecated, use .find("%(name)s") instead. If you really were looking for a tag called %(name)sTag, use .find("%(name)sTag")' % dict(
  1103. name=tag_name
  1104. )
  1105. )
  1106. return self.find(tag_name)
  1107. # We special case contents to avoid recursion.
  1108. elif not tag.startswith("__") and not tag == "contents":
  1109. return self.find(tag)
  1110. raise AttributeError(
  1111. "'%s' object has no attribute '%s'" % (self.__class__, tag))
  1112.  
  1113. def __eq__(self, other):
  1114. """Returns true iff this tag has the same name, the same attributes,
  1115. and the same contents (recursively) as the given tag."""
  1116. if self is other:
  1117. return True
  1118. if (not hasattr(other, 'name') or
  1119. not hasattr(other, 'attrs') or
  1120. not hasattr(other, 'contents') or
  1121. self.name != other.name or
  1122. self.attrs != other.attrs or
  1123. len(self) != len(other)):
  1124. return False
  1125. for i, my_child in enumerate(self.contents):
  1126. if my_child != other.contents[i]:
  1127. return False
  1128. return True
  1129.  
  1130. def __ne__(self, other):
  1131. """Returns true iff this tag is not identical to the other tag,
  1132. as defined in __eq__."""
  1133. return not self == other
  1134.  
  1135. def __repr__(self, encoding="unicode-escape"):
  1136. """Renders this tag as a string."""
  1137. if PY3K:
  1138. # "The return value must be a string object", i.e. Unicode
  1139. return self.decode()
  1140. else:
  1141. # "The return value must be a string object", i.e. a bytestring.
  1142. # By convention, the return value of __repr__ should also be
  1143. # an ASCII string.
  1144. return self.encode(encoding)
  1145.  
  1146. def __unicode__(self):
  1147. return self.decode()
  1148.  
  1149. def __str__(self):
  1150. if PY3K:
  1151. return self.decode()
  1152. else:
  1153. return self.encode()
  1154.  
  1155. if PY3K:
  1156. __str__ = __repr__ = __unicode__
  1157.  
  1158. def encode(self, encoding=DEFAULT_OUTPUT_ENCODING,
  1159. indent_level=None, formatter="minimal",
  1160. errors="xmlcharrefreplace"):
  1161. # Turn the data structure into Unicode, then encode the
  1162. # Unicode.
  1163. u = self.decode(indent_level, encoding, formatter)
  1164. return u.encode(encoding, errors)
  1165.  
  1166. def _should_pretty_print(self, indent_level):
  1167. """Should this tag be pretty-printed?"""
  1168.  
  1169. return (
  1170. indent_level is not None
  1171. and self.name not in self.preserve_whitespace_tags
  1172. )
  1173.  
  1174. def decode(self, indent_level=None,
  1175. eventual_encoding=DEFAULT_OUTPUT_ENCODING,
  1176. formatter="minimal"):
  1177. """Returns a Unicode representation of this tag and its contents.
  1178.  
  1179. :param eventual_encoding: The tag is destined to be
  1180. encoded into this encoding. This method is _not_
  1181. responsible for performing that encoding. This information
  1182. is passed in so that it can be substituted in if the
  1183. document contains a <META> tag that mentions the document's
  1184. encoding.
  1185. """
  1186.  
  1187. # First off, turn a string formatter into a Formatter object. This
  1188. # will stop the lookup from happening over and over again.
  1189. if not isinstance(formatter, Formatter) and not callable(formatter):
  1190. formatter = self._formatter_for_name(formatter)
  1191. attrs = []
  1192. if self.attrs:
  1193. for key, val in sorted(self.attrs.items()):
  1194. if val is None:
  1195. decoded = key
  1196. else:
  1197. if isinstance(val, list) or isinstance(val, tuple):
  1198. val = ' '.join(val)
  1199. elif not isinstance(val, str):
  1200. val = str(val)
  1201. elif (
  1202. isinstance(val, AttributeValueWithCharsetSubstitution)
  1203. and eventual_encoding is not None):
  1204. val = val.encode(eventual_encoding)
  1205.  
  1206. text = self.format_string(val, formatter)
  1207. decoded = (
  1208. str(key) + '='
  1209. + EntitySubstitution.quoted_attribute_value(text))
  1210. attrs.append(decoded)
  1211. close = ''
  1212. closeTag = ''
  1213.  
  1214. prefix = ''
  1215. if self.prefix:
  1216. prefix = self.prefix + ":"
  1217.  
  1218. if self.is_empty_element:
  1219. close = ''
  1220. if isinstance(formatter, Formatter):
  1221. close = formatter.void_element_close_prefix or close
  1222. else:
  1223. closeTag = '</%s%s>' % (prefix, self.name)
  1224.  
  1225. pretty_print = self._should_pretty_print(indent_level)
  1226. space = ''
  1227. indent_space = ''
  1228. if indent_level is not None:
  1229. indent_space = (' ' * (indent_level - 1))
  1230. if pretty_print:
  1231. space = indent_space
  1232. indent_contents = indent_level + 1
  1233. else:
  1234. indent_contents = None
  1235. contents = self.decode_contents(
  1236. indent_contents, eventual_encoding, formatter)
  1237.  
  1238. if self.hidden:
  1239. # This is the 'document root' object.
  1240. s = contents
  1241. else:
  1242. s = []
  1243. attribute_string = ''
  1244. if attrs:
  1245. attribute_string = ' ' + ' '.join(attrs)
  1246. if indent_level is not None:
  1247. # Even if this particular tag is not pretty-printed,
  1248. # we should indent up to the start of the tag.
  1249. s.append(indent_space)
  1250. s.append('<%s%s%s%s>' % (
  1251. prefix, self.name, attribute_string, close))
  1252. if pretty_print:
  1253. s.append("\n")
  1254. s.append(contents)
  1255. if pretty_print and contents and contents[-1] != "\n":
  1256. s.append("\n")
  1257. if pretty_print and closeTag:
  1258. s.append(space)
  1259. s.append(closeTag)
  1260. if indent_level is not None and closeTag and self.next_sibling:
  1261. # Even if this particular tag is not pretty-printed,
  1262. # we're now done with the tag, and we should add a
  1263. # newline if appropriate.
  1264. s.append("\n")
  1265. s = ''.join(s)
  1266. return s
  1267.  
  1268. def prettify(self, encoding=None, formatter="minimal"):
  1269. if encoding is None:
  1270. return self.decode(True, formatter=formatter)
  1271. else:
  1272. return self.encode(encoding, True, formatter=formatter)
  1273.  
  1274. def decode_contents(self, indent_level=None,
  1275. eventual_encoding=DEFAULT_OUTPUT_ENCODING,
  1276. formatter="minimal"):
  1277. """Renders the contents of this tag as a Unicode string.
  1278.  
  1279. :param indent_level: Each line of the rendering will be
  1280. indented this many spaces.
  1281.  
  1282. :param eventual_encoding: The tag is destined to be
  1283. encoded into this encoding. This method is _not_
  1284. responsible for performing that encoding. This information
  1285. is passed in so that it can be substituted in if the
  1286. document contains a <META> tag that mentions the document's
  1287. encoding.
  1288.  
  1289. :param formatter: The output formatter responsible for converting
  1290. entities to Unicode characters.
  1291. """
  1292. # First off, turn a string formatter into a Formatter object. This
  1293. # will stop the lookup from happening over and over again.
  1294. if not isinstance(formatter, Formatter) and not callable(formatter):
  1295. formatter = self._formatter_for_name(formatter)
  1296.  
  1297. pretty_print = (indent_level is not None)
  1298. s = []
  1299. for c in self:
  1300. text = None
  1301. if isinstance(c, NavigableString):
  1302. text = c.output_ready(formatter)
  1303. elif isinstance(c, Tag):
  1304. s.append(c.decode(indent_level, eventual_encoding,
  1305. formatter))
  1306. if text and indent_level and not self.name == 'pre':
  1307. text = text.strip()
  1308. if text:
  1309. if pretty_print and not self.name == 'pre':
  1310. s.append(" " * (indent_level - 1))
  1311. s.append(text)
  1312. if pretty_print and not self.name == 'pre':
  1313. s.append("\n")
  1314. return ''.join(s)
  1315.  
  1316. def encode_contents(
  1317. self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING,
  1318. formatter="minimal"):
  1319. """Renders the contents of this tag as a bytestring.
  1320.  
  1321. :param indent_level: Each line of the rendering will be
  1322. indented this many spaces.
  1323.  
  1324. :param eventual_encoding: The bytestring will be in this encoding.
  1325.  
  1326. :param formatter: The output formatter responsible for converting
  1327. entities to Unicode characters.
  1328. """
  1329.  
  1330. contents = self.decode_contents(indent_level, encoding, formatter)
  1331. return contents.encode(encoding)
  1332.  
  1333. # Old method for BS3 compatibility
  1334. def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
  1335. prettyPrint=False, indentLevel=0):
  1336. if not prettyPrint:
  1337. indentLevel = None
  1338. return self.encode_contents(
  1339. indent_level=indentLevel, encoding=encoding)
  1340.  
  1341. #Soup methods
  1342.  
  1343. def find(self, name=None, attrs={}, recursive=True, text=None,
  1344. **kwargs):
  1345. """Return only the first child of this Tag matching the given
  1346. criteria."""
  1347. r = None
  1348. l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
  1349. if l:
  1350. r = l[0]
  1351. return r
  1352. findChild = find
  1353.  
  1354. def find_all(self, name=None, attrs={}, recursive=True, text=None,
  1355. limit=None, **kwargs):
  1356. """Extracts a list of Tag objects that match the given
  1357. criteria. You can specify the name of the Tag and any
  1358. attributes you want the Tag to have.
  1359.  
  1360. The value of a key-value pair in the 'attrs' map can be a
  1361. string, a list of strings, a regular expression object, or a
  1362. callable that takes a string and returns whether or not the
  1363. string matches for some custom definition of 'matches'. The
  1364. same is true of the tag name."""
  1365.  
  1366. generator = self.descendants
  1367. if not recursive:
  1368. generator = self.children
  1369. return self._find_all(name, attrs, text, limit, generator, **kwargs)
  1370. findAll = find_all # BS3
  1371. findChildren = find_all # BS2
  1372.  
  1373. #Generator methods
  1374. @property
  1375. def children(self):
  1376. # return iter() to make the purpose of the method clear
  1377. return iter(self.contents) # XXX This seems to be untested.
  1378.  
  1379. @property
  1380. def descendants(self):
  1381. if not len(self.contents):
  1382. return
  1383. stopNode = self._last_descendant().next_element
  1384. current = self.contents[0]
  1385. while current is not stopNode:
  1386. yield current
  1387. current = current.next_element
  1388.  
  1389. # CSS selector code
  1390.  
  1391. _selector_combinators = ['>', '+', '~']
  1392. _select_debug = False
  1393. quoted_colon = re.compile('"[^"]*:[^"]*"')
  1394. def select_one(self, selector):
  1395. """Perform a CSS selection operation on the current element."""
  1396. value = self.select(selector, limit=1)
  1397. if value:
  1398. return value[0]
  1399. return None
  1400.  
  1401. def select(self, selector, _candidate_generator=None, limit=None):
  1402. """Perform a CSS selection operation on the current element."""
  1403.  
  1404. # Handle grouping selectors if ',' exists, ie: p,a
  1405. if ',' in selector:
  1406. context = []
  1407. selectors = [x.strip() for x in selector.split(",")]
  1408.  
  1409. # If a selector is mentioned multiple times we don't want
  1410. # to use it more than once.
  1411. used_selectors = set()
  1412.  
  1413. # We also don't want to select the same element more than once,
  1414. # if it's matched by multiple selectors.
  1415. selected_object_ids = set()
  1416. for partial_selector in selectors:
  1417. if partial_selector == '':
  1418. raise ValueError('Invalid group selection syntax: %s' % selector)
  1419. if partial_selector in used_selectors:
  1420. continue
  1421. used_selectors.add(partial_selector)
  1422. candidates = self.select(partial_selector, limit=limit)
  1423. for candidate in candidates:
  1424. # This lets us distinguish between distinct tags that
  1425. # represent the same markup.
  1426. object_id = id(candidate)
  1427. if object_id not in selected_object_ids:
  1428. context.append(candidate)
  1429. selected_object_ids.add(object_id)
  1430. if limit and len(context) >= limit:
  1431. break
  1432. return context
  1433. tokens = shlex.split(selector)
  1434. current_context = [self]
  1435.  
  1436. if tokens[-1] in self._selector_combinators:
  1437. raise ValueError(
  1438. 'Final combinator "%s" is missing an argument.' % tokens[-1])
  1439.  
  1440. if self._select_debug:
  1441. print('Running CSS selector "%s"' % selector)
  1442.  
  1443. for index, token in enumerate(tokens):
  1444. new_context = []
  1445. new_context_ids = set([])
  1446.  
  1447. if tokens[index-1] in self._selector_combinators:
  1448. # This token was consumed by the previous combinator. Skip it.
  1449. if self._select_debug:
  1450. print(' Token was consumed by the previous combinator.')
  1451. continue
  1452.  
  1453. if self._select_debug:
  1454. print(' Considering token "%s"' % token)
  1455. recursive_candidate_generator = None
  1456. tag_name = None
  1457.  
  1458. # Each operation corresponds to a checker function, a rule
  1459. # for determining whether a candidate matches the
  1460. # selector. Candidates are generated by the active
  1461. # iterator.
  1462. checker = None
  1463.  
  1464. m = self.attribselect_re.match(token)
  1465. if m is not None:
  1466. # Attribute selector
  1467. tag_name, attribute, operator, value = m.groups()
  1468. checker = self._attribute_checker(operator, attribute, value)
  1469.  
  1470. elif '#' in token:
  1471. # ID selector
  1472. tag_name, tag_id = token.split('#', 1)
  1473. def id_matches(tag):
  1474. return tag.get('id', None) == tag_id
  1475. checker = id_matches
  1476.  
  1477. elif '.' in token:
  1478. # Class selector
  1479. tag_name, klass = token.split('.', 1)
  1480. classes = set(klass.split('.'))
  1481. def classes_match(candidate):
  1482. return classes.issubset(candidate.get('class', []))
  1483. checker = classes_match
  1484.  
  1485. elif ':' in token and not self.quoted_colon.search(token):
  1486. # Pseudo-class
  1487. tag_name, pseudo = token.split(':', 1)
  1488. if tag_name == '':
  1489. raise ValueError(
  1490. "A pseudo-class must be prefixed with a tag name.")
  1491. pseudo_attributes = re.match(r'([a-zA-Z\d-]+)\(([a-zA-Z\d]+)\)', pseudo)
  1492. found = []
  1493. if pseudo_attributes is None:
  1494. pseudo_type = pseudo
  1495. pseudo_value = None
  1496. else:
  1497. pseudo_type, pseudo_value = pseudo_attributes.groups()
  1498. if pseudo_type == 'nth-of-type':
  1499. try:
  1500. pseudo_value = int(pseudo_value)
  1501. except:
  1502. raise NotImplementedError(
  1503. 'Only numeric values are currently supported for the nth-of-type pseudo-class.')
  1504. if pseudo_value < 1:
  1505. raise ValueError(
  1506. 'nth-of-type pseudo-class value must be at least 1.')
  1507. class Counter(object):
  1508. def __init__(self, destination):
  1509. self.count = 0
  1510. self.destination = destination
  1511.  
  1512. def nth_child_of_type(self, tag):
  1513. self.count += 1
  1514. if self.count == self.destination:
  1515. return True
  1516. else:
  1517. return False
  1518. checker = Counter(pseudo_value).nth_child_of_type
  1519. else:
  1520. raise NotImplementedError(
  1521. 'Only the following pseudo-classes are implemented: nth-of-type.')
  1522.  
  1523. elif token == '*':
  1524. # Star selector -- matches everything
  1525. pass
  1526. elif token == '>':
  1527. # Run the next token as a CSS selector against the
  1528. # direct children of each tag in the current context.
  1529. recursive_candidate_generator = lambda tag: tag.children
  1530. elif token == '~':
  1531. # Run the next token as a CSS selector against the
  1532. # siblings of each tag in the current context.
  1533. recursive_candidate_generator = lambda tag: tag.next_siblings
  1534. elif token == '+':
  1535. # For each tag in the current context, run the next
  1536. # token as a CSS selector against the tag's next
  1537. # sibling that's a tag.
  1538. def next_tag_sibling(tag):
  1539. yield tag.find_next_sibling(True)
  1540. recursive_candidate_generator = next_tag_sibling
  1541.  
  1542. elif self.tag_name_re.match(token):
  1543. # Just a tag name.
  1544. tag_name = token
  1545. else:
  1546. raise ValueError(
  1547. 'Unsupported or invalid CSS selector: "%s"' % token)
  1548. if recursive_candidate_generator:
  1549. # This happens when the selector looks like "> foo".
  1550. #
  1551. # The generator calls select() recursively on every
  1552. # member of the current context, passing in a different
  1553. # candidate generator and a different selector.
  1554. #
  1555. # In the case of "> foo", the candidate generator is
  1556. # one that yields a tag's direct children (">"), and
  1557. # the selector is "foo".
  1558. next_token = tokens[index+1]
  1559. def recursive_select(tag):
  1560. if self._select_debug:
  1561. print(' Calling select("%s") recursively on %s %s' % (next_token, tag.name, tag.attrs))
  1562. print('-' * 40)
  1563. for i in tag.select(next_token, recursive_candidate_generator):
  1564. if self._select_debug:
  1565. print('(Recursive select picked up candidate %s %s)' % (i.name, i.attrs))
  1566. yield i
  1567. if self._select_debug:
  1568. print('-' * 40)
  1569. _use_candidate_generator = recursive_select
  1570. elif _candidate_generator is None:
  1571. # By default, a tag's candidates are all of its
  1572. # children. If tag_name is defined, only yield tags
  1573. # with that name.
  1574. if self._select_debug:
  1575. if tag_name:
  1576. check = "[any]"
  1577. else:
  1578. check = tag_name
  1579. print(' Default candidate generator, tag name="%s"' % check)
  1580. if self._select_debug:
  1581. # This is redundant with later code, but it stops
  1582. # a bunch of bogus tags from cluttering up the
  1583. # debug log.
  1584. def default_candidate_generator(tag):
  1585. for child in tag.descendants:
  1586. if not isinstance(child, Tag):
  1587. continue
  1588. if tag_name and not child.name == tag_name:
  1589. continue
  1590. yield child
  1591. _use_candidate_generator = default_candidate_generator
  1592. else:
  1593. _use_candidate_generator = lambda tag: tag.descendants
  1594. else:
  1595. _use_candidate_generator = _candidate_generator
  1596.  
  1597. count = 0
  1598. for tag in current_context:
  1599. if self._select_debug:
  1600. print(" Running candidate generator on %s %s" % (
  1601. tag.name, repr(tag.attrs)))
  1602. for candidate in _use_candidate_generator(tag):
  1603. if not isinstance(candidate, Tag):
  1604. continue
  1605. if tag_name and candidate.name != tag_name:
  1606. continue
  1607. if checker is not None:
  1608. try:
  1609. result = checker(candidate)
  1610. except StopIteration:
  1611. # The checker has decided we should no longer
  1612. # run the generator.
  1613. break
  1614. if checker is None or result:
  1615. if self._select_debug:
  1616. print(" SUCCESS %s %s" % (candidate.name, repr(candidate.attrs)))
  1617. if id(candidate) not in new_context_ids:
  1618. # If a tag matches a selector more than once,
  1619. # don't include it in the context more than once.
  1620. new_context.append(candidate)
  1621. new_context_ids.add(id(candidate))
  1622. elif self._select_debug:
  1623. print(" FAILURE %s %s" % (candidate.name, repr(candidate.attrs)))
  1624.  
  1625. current_context = new_context
  1626. if limit and len(current_context) >= limit:
  1627. current_context = current_context[:limit]
  1628.  
  1629. if self._select_debug:
  1630. print("Final verdict:")
  1631. for i in current_context:
  1632. print(" %s %s" % (i.name, i.attrs))
  1633. return current_context
  1634.  
  1635. # Old names for backwards compatibility
  1636. def childGenerator(self):
  1637. return self.children
  1638.  
  1639. def recursiveChildGenerator(self):
  1640. return self.descendants
  1641.  
  1642. def has_key(self, key):
  1643. """This was kind of misleading because has_key() (attributes)
  1644. was different from __in__ (contents). has_key() is gone in
  1645. Python 3, anyway."""
  1646. warnings.warn('has_key is deprecated. Use has_attr("%s") instead.' % (
  1647. key))
  1648. return self.has_attr(key)
  1649.  
  1650. # Next, a couple classes to represent queries and their results.
  1651. class SoupStrainer(object):
  1652. """Encapsulates a number of ways of matching a markup element (tag or
  1653. text)."""
  1654.  
  1655. def __init__(self, name=None, attrs={}, text=None, **kwargs):
  1656. self.name = self._normalize_search_value(name)
  1657. if not isinstance(attrs, dict):
  1658. # Treat a non-dict value for attrs as a search for the 'class'
  1659. # attribute.
  1660. kwargs['class'] = attrs
  1661. attrs = None
  1662.  
  1663. if 'class_' in kwargs:
  1664. # Treat class_="foo" as a search for the 'class'
  1665. # attribute, overriding any non-dict value for attrs.
  1666. kwargs['class'] = kwargs['class_']
  1667. del kwargs['class_']
  1668.  
  1669. if kwargs:
  1670. if attrs:
  1671. attrs = attrs.copy()
  1672. attrs.update(kwargs)
  1673. else:
  1674. attrs = kwargs
  1675. normalized_attrs = {}
  1676. for key, value in list(attrs.items()):
  1677. normalized_attrs[key] = self._normalize_search_value(value)
  1678.  
  1679. self.attrs = normalized_attrs
  1680. self.text = self._normalize_search_value(text)
  1681.  
  1682. def _normalize_search_value(self, value):
  1683. # Leave it alone if it's a Unicode string, a callable, a
  1684. # regular expression, a boolean, or None.
  1685. if (isinstance(value, str) or callable(value) or hasattr(value, 'match')
  1686. or isinstance(value, bool) or value is None):
  1687. return value
  1688.  
  1689. # If it's a bytestring, convert it to Unicode, treating it as UTF-8.
  1690. if isinstance(value, bytes):
  1691. return value.decode("utf8")
  1692.  
  1693. # If it's listlike, convert it into a list of strings.
  1694. if hasattr(value, '__iter__'):
  1695. new_value = []
  1696. for v in value:
  1697. if (hasattr(v, '__iter__') and not isinstance(v, bytes)
  1698. and not isinstance(v, str)):
  1699. # This is almost certainly the user's mistake. In the
  1700. # interests of avoiding infinite loops, we'll let
  1701. # it through as-is rather than doing a recursive call.
  1702. new_value.append(v)
  1703. else:
  1704. new_value.append(self._normalize_search_value(v))
  1705. return new_value
  1706.  
  1707. # Otherwise, convert it into a Unicode string.
  1708. # The unicode(str()) thing is so this will do the same thing on Python 2
  1709. # and Python 3.
  1710. return str(str(value))
  1711.  
  1712. def __str__(self):
  1713. if self.text:
  1714. return self.text
  1715. else:
  1716. return "%s|%s" % (self.name, self.attrs)
  1717.  
  1718. def search_tag(self, markup_name=None, markup_attrs={}):
  1719. found = None
  1720. markup = None
  1721. if isinstance(markup_name, Tag):
  1722. markup = markup_name
  1723. markup_attrs = markup
  1724. call_function_with_tag_data = (
  1725. isinstance(self.name, Callable)
  1726. and not isinstance(markup_name, Tag))
  1727.  
  1728. if ((not self.name)
  1729. or call_function_with_tag_data
  1730. or (markup and self._matches(markup, self.name))
  1731. or (not markup and self._matches(markup_name, self.name))):
  1732. if call_function_with_tag_data:
  1733. match = self.name(markup_name, markup_attrs)
  1734. else:
  1735. match = True
  1736. markup_attr_map = None
  1737. for attr, match_against in list(self.attrs.items()):
  1738. if not markup_attr_map:
  1739. if hasattr(markup_attrs, 'get'):
  1740. markup_attr_map = markup_attrs
  1741. else:
  1742. markup_attr_map = {}
  1743. for k, v in markup_attrs:
  1744. markup_attr_map[k] = v
  1745. attr_value = markup_attr_map.get(attr)
  1746. if not self._matches(attr_value, match_against):
  1747. match = False
  1748. break
  1749. if match:
  1750. if markup:
  1751. found = markup
  1752. else:
  1753. found = markup_name
  1754. if found and self.text and not self._matches(found.string, self.text):
  1755. found = None
  1756. return found
  1757. searchTag = search_tag
  1758.  
  1759. def search(self, markup):
  1760. # print 'looking for %s in %s' % (self, markup)
  1761. found = None
  1762. # If given a list of items, scan it for a text element that
  1763. # matches.
  1764. if hasattr(markup, '__iter__') and not isinstance(markup, (Tag, str)):
  1765. for element in markup:
  1766. if isinstance(element, NavigableString) \
  1767. and self.search(element):
  1768. found = element
  1769. break
  1770. # If it's a Tag, make sure its name or attributes match.
  1771. # Don't bother with Tags if we're searching for text.
  1772. elif isinstance(markup, Tag):
  1773. if not self.text or self.name or self.attrs:
  1774. found = self.search_tag(markup)
  1775. # If it's text, make sure the text matches.
  1776. elif isinstance(markup, NavigableString) or \
  1777. isinstance(markup, str):
  1778. if not self.name and not self.attrs and self._matches(markup, self.text):
  1779. found = markup
  1780. else:
  1781. raise Exception(
  1782. "I don't know how to match against a %s" % markup.__class__)
  1783. return found
  1784.  
  1785. def _matches(self, markup, match_against, already_tried=None):
  1786. # print u"Matching %s against %s" % (markup, match_against)
  1787. result = False
  1788. if isinstance(markup, list) or isinstance(markup, tuple):
  1789. # This should only happen when searching a multi-valued attribute
  1790. # like 'class'.
  1791. for item in markup:
  1792. if self._matches(item, match_against):
  1793. return True
  1794. # We didn't match any particular value of the multivalue
  1795. # attribute, but maybe we match the attribute value when
  1796. # considered as a string.
  1797. if self._matches(' '.join(markup), match_against):
  1798. return True
  1799. return False
  1800.  
  1801. if match_against is True:
  1802. # True matches any non-None value.
  1803. return markup is not None
  1804.  
  1805. if isinstance(match_against, Callable):
  1806. return match_against(markup)
  1807.  
  1808. # Custom callables take the tag as an argument, but all
  1809. # other ways of matching match the tag name as a string.
  1810. original_markup = markup
  1811. if isinstance(markup, Tag):
  1812. markup = markup.name
  1813.  
  1814. # Ensure that `markup` is either a Unicode string, or None.
  1815. markup = self._normalize_search_value(markup)
  1816.  
  1817. if markup is None:
  1818. # None matches None, False, an empty string, an empty list, and so on.
  1819. return not match_against
  1820.  
  1821. if (hasattr(match_against, '__iter__')
  1822. and not isinstance(match_against, str)):
  1823. # We're asked to match against an iterable of items.
  1824. # The markup must be match at least one item in the
  1825. # iterable. We'll try each one in turn.
  1826. #
  1827. # To avoid infinite recursion we need to keep track of
  1828. # items we've already seen.
  1829. if not already_tried:
  1830. already_tried = set()
  1831. for item in match_against:
  1832. if item.__hash__:
  1833. key = item
  1834. else:
  1835. key = id(item)
  1836. if key in already_tried:
  1837. continue
  1838. else:
  1839. already_tried.add(key)
  1840. if self._matches(original_markup, item, already_tried):
  1841. return True
  1842. else:
  1843. return False
  1844.  
  1845. # Beyond this point we might need to run the test twice: once against
  1846. # the tag's name and once against its prefixed name.
  1847. match = False
  1848.  
  1849. if not match and isinstance(match_against, str):
  1850. # Exact string match
  1851. match = markup == match_against
  1852.  
  1853. if not match and hasattr(match_against, 'search'):
  1854. # Regexp match
  1855. return match_against.search(markup)
  1856.  
  1857. if (not match
  1858. and isinstance(original_markup, Tag)
  1859. and original_markup.prefix):
  1860. # Try the whole thing again with the prefixed tag name.
  1861. return self._matches(
  1862. original_markup.prefix + ':' + original_markup.name, match_against
  1863. )
  1864.  
  1865. return match
  1866.  
  1867. class ResultSet(list):
  1868. """A ResultSet is just a list that keeps track of the SoupStrainer
  1869. that created it."""
  1870. def __init__(self, source, result=()):
  1871. super(ResultSet, self).__init__(result)
  1872. self.source = source
  1873.  
  1874. def __getattr__(self, key):
  1875. raise AttributeError(
  1876. "ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
  1877. )

bs4源码的更多相关文章

  1. 深入新版BS4源码 探索flex和工程化sass奥秘

    你可能已经听说了一个“大新闻”:Bootstrap4 合并了代号为#21389的PR,宣布放弃支持IE9,并默认使用flexbox弹性盒模型.这标志着:1)前端开发全面步入“现代浏览器”的时代进一步来 ...

  2. html 网页源码解析:bs4中BeautifulSoup

    from bs4 import BeautifulSoup result=requests.request("get","http://www.baidu.com&quo ...

  3. Spark源码的编译过程详细解读(各版本)

    说在前面的话   重新试多几次.编译过程中会出现下载某个包的时间太久,这是由于连接网站的过程中会出现假死,按ctrl+c,重新运行编译命令. 如果出现缺少了某个文件的情况,则要先清理maven(使用命 ...

  4. Goldeneye.py网站压力测试工具2.1版源码

    Goldeneye压力测试工具的源代码,粗略看了下,代码写的蛮规范和易读的,打算边读边加上了中文注释,但是想来也没太大必要,代码600多行,值得学习的地方还是蛮多的,喜欢Python的同学可以一读 这 ...

  5. Python网络爬虫:空姐网、糗百、xxx结果图与源码

    如前面所述,我们上手写了空姐网爬虫,糗百爬虫,先放一下传送门: Python网络爬虫requests.bs4爬取空姐网图片Python爬虫框架Scrapy之爬取糗事百科大量段子数据Python爬虫框架 ...

  6. 在view source页面保存下来的网页源码和保存网页得到的源码不同

    前言 以前抓网页都是直接requests+bs4直接刚的,今天想拿一下拉钩的数据,就继续按照以下步骤来了: 先找个想爬的网页,然后写解析功能 批量爬,然后解析 入库 探究 结果发现行不通了,用bs4去 ...

  7. 用Beautiful Soup解析html源码

    #xiaodeng #python3 #用Beautiful Soup解析html源码 html_doc = """ <html> <head> ...

  8. Spark源码的编译过程详细解读(各版本)(博主推荐)

    不多说,直接上干货! 说在前面的话   重新试多几次.编译过程中会出现下载某个包的时间太久,这是由于连接网站的过程中会出现假死,按ctrl+c,重新运行编译命令.  如果出现缺少了某个文件的情况,则要 ...

  9. python3 爬虫教学之爬取链家二手房(最下面源码) //以更新源码

    前言 作为一只小白,刚进入Python爬虫领域,今天尝试一下爬取链家的二手房,之前已经爬取了房天下的了,看看链家有什么不同,马上开始. 一.分析观察爬取网站结构 这里以广州链家二手房为例:http:/ ...

随机推荐

  1. Unity备份新知识待写

    Unity开发VR之Vuforia 本文提供全流程,中文翻译. Chinar 坚持将简单的生活方式,带给世人!(拥有更好的阅读体验 -- 高分辨率用户请根据需求调整网页缩放比例) Chinar -- ...

  2. SQLI DUMB SERIES-13

    (1)检测闭合方式 通过 ') 闭合. (2)尝试输入 admin')# 无回显.尝试报错注入, 爆表payload: admin') and extractvalue(1,concat(0x7e,( ...

  3. tornado--初识tornado

    tornado的第一个程序 import tornado.ioloop import tornado.web class Index(tornado.web.RequestHandler): def ...

  4. unity 常用插件 3

    一.   遮罩插件   Alpha Mask UI Sprites Quads 1.51 介绍:功能感觉很强大的一个遮罩插件,能实现LOGO高光闪动动画,圆形遮罩,透明通道图片遮罩,还真是项目必备. ...

  5. python win32com.client

    搜集的一些关于win32com.client操作office的相关用法 #创建 #word w = win32com.client.Dispatch("Word.Application&qu ...

  6. Python安装及IDE激活

    简介: Windows10下安装激活Pycharm,并同时安装Python 3.x.2.x,便于在Pycharm开发环境中使用不同版本的解释器进行对比学习. 目录: 一.Python 3.x安装 二. ...

  7. C#知识点汇总

    核心技术课程 编程基础强化练习.面向过程编程(变量.if.for.while.函数.类型转换.枚举 .数组.重载.值类型.引用类型.ref.out.字符串).面向对象编程(类.继承 .接口.异常.索引 ...

  8. python中使用if __name__ == '__main__':

    引子 在python中,假设在一个test1.py的模块中定义了一个foo函数,然后调用函数foo进行测试的时候会产生一个内存空间.当你把这个模块导入到test2.py模块中,接下来如果在test2. ...

  9. C++ 自定义时间

      今天精神状态不好,和公司的领导请了假.为了抵抗我的痛苦,我在床上打坐冥想,从早上九点到下午三点二十六.嗯,感觉好多了.这种温和的暴力果然有效.   之后吃了点东西,然后无聊的我就在想,明天的工作该 ...

  10. nginx添加ssl证书

    ssl的证书是通过docker nginx letsencrypt 这篇随笔生成的,下面介绍如何在nginx中添加ssl 这个为全部配置, 需要替换你自己的域名,配置中强制https了 server ...