本文知识点:
 
 
1潜在畸形页面使用htmlTreeParse函数
2startElement的用法
3闭包
4handler函数的命令和函数体主要写法
5节点的丢弃,取出,取出标签名称、属性、属性值、内容
6修改树中节点的属性、节点计数、存储节点
7匿名函数写法
8xmlHashTree函数和xmlRoot函数和trun参数(此条存疑)
9编码
10try和trycatch,中断
11xinclude
 
原书中虽然主要是关于HTML的,但是我想把重心放在2.4解析一节的内容,进行扩充和增加自己的理解。
==========================================================================
一、HTML部分简要的摘抄几条吧

1在chrome(chrome的效果相对比用360极速好,虽然内核一致),选中一行文本,右键检查(inspect),就可以选中对应的那一行HTML源码

 
2Attributes are always placed within the start tag right after the tag name. A tag can hold multiple attributes that are simply separated by a space character. Attributes are expressed as name–value pairs, as in name="value". The value can either be enclosed by single or double quotation marks. However, if the attribute value itself contains one type of quotation mark, the other type has to be used to enclose the value:

 
3
Spaces and line breaks in HTML source code do not translate directly into spaces and line
breaks in the browser presentation. While line breaks are ignored altogether, any number
of consecutive spaces are presented as a single space
 
 
以下是本章用到的HTML文件的链接
 
二、 关于使用R
读取页面:
  1. url <-"http://www.r-datacollection.com/materials/html/fortunes.html"
  2. > fortunes <- readLines(con = url)
  3. > fortunes
  4. [1]"<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML//EN\">"
  5. [2]"<html> <head>"
  6. [3]"<title>Collected R wisdoms</title>"
  7. [4]"</head>"
  8. [5]""
  9. [6]"<body>"
  10. [7]"<div id=\"R Inventor\" lang=\"english\" date=\"June/2003\">"
  11. [8]" <h1>Robert Gentleman</h1>"
  12. [9]" <p><i>'What we have is nice, but we need something very different'</i></p>"
  13. [10]" <p><b>Source: </b>Statistical Computing 2003, Reisensburg"
  14. [11]"</div>"
  15. [12]""
  16. [13]"<div lang=english date=\"October/2011\">"
  17. [14]" <h1>Rolf Turner</h1>"
  18. [15]" <p><i>'R is wonderful, but it cannot work magic'</i> <br><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>"
  19. [16]" <p><b>Source: </b><a href=\"https://stat.ethz.ch/mailman/listinfo/r-help\">R-help</a></p>"
  20. [17]"</div>"
  21. [18]""
  22. [19]"<address><a href=\"www.r-datacollectionbook.com\"><i>The book homepage</i><a/></address>"
  23. [20]""
  24. [21]"</body> </html>"
上面的结果有两个问题(以浅蓝色底纹标出):
问题1:部分属性的值未有加上引号
问题2:漏了第二个段落标签的结束标签</p>
此处,readLines函数是将输入文件的每一行映射到一个字符向量的元素里,也就是说,向量fortunes的每一个元素就是一行HTML代码。
 
正因为上述的问题,我们使用以下方式解析:
 
  1. > library(XML)
  2. > parsed_fortunes <- htmlParse(file = url)
  3. >print(parsed_fortunes)
    1. <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
    2. <html>
    3. <head><title>Collected R wisdoms</title></head>
    4. <body>
    5. <divid="R Inventor"lang="english"date="June/2003">
    6. <h1>Robert Gentleman</h1>
    7. <p><i>'What we have is nice, but we need something very different'</i></p>
    8. <p><b>Source: </b>Statistical Computing 2003, Reisensburg
    9. </p>
    10. </div>
    11. <divlang="english"date="October/2011">
    12. <h1>Rolf Turner</h1>
    13. <p><i>'R is wonderful, but it cannot work magic'</i><br><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
    14. <p><b>Source: </b><ahref="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
    15. </div>
    16. <address>
    17. <ahref="www.r-datacollectionbook.com"><i>The book homepage</i></a><a></a>
    18. </address>
    19. </body>
    20. </html>
     
     
此时OK。
看一下的说明描述
Parses an XML or HTML file or string containing XML/HTML content, and generates an R structure representing the XML/HTML tree. Use htmlTreeParse when the content is known to be (potentially malformed潜在畸形的) HTML. This function has numerous parameters/options and operates quite differently based on their values. It can create trees in R or using internal C-level nodes, both of which are useful in different contexts. It can perform conversion of the nodes into R objects using caller-specified handler functions and this can be used to map the XML document directly into R data structures, by-passing the conversion to an R-level tree which would then be processed recursively or with multiple descents to extract the information of interest.
 
由于我们在抓取网页的时候,并不需要所有的数据,且为了加快运行速度,我们在构建树的阶段,就通过处理函数只提取我们感兴趣的节点。
htmlTreeParse函数的具体源码我们在另一篇笔记中详细展开。
下面先来看下处理函数
例子1
  1. #代码片段3
  2. h1 <- list("body"= function(x){NULL})
  3. parsed_fortunes <- htmlTreeParse(url, handlers = h1, asTree = TRUE)
  4. parsed_fortunes$children
  5. # $html
  6. # <html>
  7. # <head>
  8. # <title>Collected R wisdoms</title>
  9. # </head>
  10. # </html>
写成h1 <- list(body = function(x){NULL})是一样的
也就是遇到body标签的时候返回空,这样相当于删掉了body标签(包括其子标签)
XML节点会被传递给x
注意,R的函数的特点是最后一行语句作为返回值(这样就可以不必写return(xxx)啦)
 
例子2
  1. h2 <- list(
  2. startElement = function(node,...){
  3. name <- xmlName(node)
  4. if(name %in% c("div","title")){NULL}else{node}
  5. },
  6. comment = function(node){NULL}
  7. )
  8. parsed_fortunes <- htmlTreeParse(file = url, handlers = h2, asTree = TRUE)
  9. parsed_fortunes$children
    1. $html
    2. <html>
    3. <head/>
    4. <body>
    5. <address>
    6. <ahref="www.r-datacollectionbook.com">
    7. <i>The book homepage</i>
    8. </a>
    9. <a/>
    10. </address>
    11. </body>
    12. </html>
     
其中:
?'%in%'
%in% is a more intuitive interface(更直观的形式) as a binary operator, which returns a logical vector indicating if there is a match or not for its left operand.
%in%的作用和match函数是一样的(在另一篇笔记中有介绍)
XML节点会被传递给x,node不过是个形参而已
函数的作用是将div, title节点删除,将注释删除
其中:xmlName是获取当前节点的名字,同类的函数还有:
xmlChildren当前节点的子节点(子节点们)
xmlAttrs当前节点的属性(返回name-value pairs的形式的字符串向量)
xmlValue当前节点的值(就是被标签包裹的内容啦)
(通过See Also)
 
此时
例子1.1
  1. h1 <- list("body"= function(x){
  2. print('here is a body tag')
  3. NULL
  4. })
  5. parsed_fortunes <- htmlTreeParse(url, handlers = h1, asTree = TRUE)
  6. [1] "here is a body tag"
可以看到body函数在遇到名为body的标签的时候被调用
 
例子2.1
#测试startElement的用法
 
  1. i <-0
  2. h2 <- list(
  3.   startElement = function(node,...){
  4.     i <<- i +1
  5.     print(paste("here is the ",i,"st tag,its name is",xmlName(node)))
  6.     NULL
  7.   }
  8.   
  9.   # comment = function(node){
  10.   #   print(paste("here is a comment,its name is",xmlName(node)))
  11.   #   NULL
  12.   # }
  13. )
  14. parsed_fortunes <- htmlTreeParse(file = url, handlers = h2, asTree = TRUE)
  15.  
  16. [1]"here is the  1 st tag,its name is title"
  17. [1]"here is the  2 st tag,its name is head"
  18. [1]"here is the  3 st tag,its name is h1"
  19. [1]"here is the  4 st tag,its name is i"
  20. [1]"here is the  5 st tag,its name is p"
  21. [1]"here is the  6 st tag,its name is b"
  22. [1]"here is the  7 st tag,its name is p"
  23. [1]"here is the  8 st tag,its name is div"
  24. [1]"here is the  9 st tag,its name is h1"
  25. [1]"here is the  10 st tag,its name is i"
  26. [1]"here is the  11 st tag,its name is br"
  27. [1]"here is the  12 st tag,its name is emph"
  28. [1]"here is the  13 st tag,its name is p"
  29. [1]"here is the  14 st tag,its name is b"
  30. [1]"here is the  15 st tag,its name is a"
  31. [1]"here is the  16 st tag,its name is p"
  32. [1]"here is the  17 st tag,its name is div"
  33. [1]"here is the  18 st tag,its name is i"
  34. [1]"here is the  19 st tag,its name is a"
  35. [1]"here is the  20 st tag,its name is a"
  36. [1]"here is the  21 st tag,its name is address"
  37. [1]"here is the  22 st tag,its name is body"
  38. [1]"here is the  23 st tag,its name is html"
我们可以看到,遇到任意一个标签的时候,startElement就被调用了,真是人如其名啊,start元素~
而且,我们的标签被认作节点是以遇到其结束标签开始的。
(这里为了解决变量作用域的问题,用了超赋值运算符)
至于他是怎么被调用的,那肯定是在C/C++中写好了吧,如果是在java中,就应该是集合容器的遍历
 
例子3
 
  1. getItalics = function(){
  2.   i_container = character()
  3.   list(i = function(node,...){
  4.     i_container <<- c(i_container, xmlValue(node))
  5.   }, returnI = function() i_container)
  6. }
  7. h3 <- getItalics()
  8. invisible(htmlTreeParse(url, handlers = h3))
  9. h3$returnI()
  10. [1]"'What we have is nice, but we need something very different'"
  11. [2]"'R is wonderful, but it cannot work magic'"                  
  12. [3]"The book homepage"
这里书上提到使用了闭包,关于闭包,请参见《R语言编程艺术》P147
也可以再参考下:《R Language Definition》
讲真,我觉得书上和官方的说法是有出入的
另外在R-help也提到了,是和官方一直的
另外这个号称esoteric R的链接中的两篇文章也可以一看:

 呃,其实我自己还没看完呢,等我看完书再回头来看~
JS角度的闭包也可以作为参考
 
关键在于函数的命名
该处理函数的作用即取出标签i中的值
其中i_container <<- c(i_container, xmlValue(node))的意思是?
我们来看个例子:
 
  1. > a = character()
  2. > b<-c(a,'2')
  3. > b
  4. [1]"2"
  5. > c<-c(b,'3')
  6. > c
  7. [1]"2""3"
  8. > a
  9. character(0)
  10. > b
  11. [1]"2"
也就是说i_container <<- c(i_container, xmlValue(node))的作用是将i_container和节点值结合,这里的c(),就是在原来的基础上添加了。这样可以保持变量名不变,貌似比用下标对向量元素赋值快捷。
invisible是使得输出不可见
有木有觉得
, returnI = function() i_container)
这样的写法也很神奇?
 
源码文件:
实际上,就算我下载包的说明的pdf,也仅仅是便于检索,其目录书签帮助我们看到有哪些函数,并不能深入了解太多。
===================================================================
三、帮助文档的摘要补充
The handlers argument is used similarly to those specified in xmlEventParse. When an XML tag (element) is processed, we look for a function in this collection with the same name as the tag's nameIf this is not found, we look for one named startElement. If this is not found, we use the default built in converter(变换器). The same works for comments, entity references, cdata, processing instructions, etc. The default entries should be named comment, startElement, externalEntity, processingInstruction, text, cdata and namespace. All but the last should take the XMLnode as their first argument. In the future, other information may be passed via ..., for example, the depth in the tree, etc. Specifically, the second argument will be the parent node into which they are being added, but this is not currently implemented, so should have a default value (NULL).
当一个标签被处理时,在函数集里1先找和标签同名的函数,2找startElement,最后才找默认的函数
当一个注释等被处理时,也是一样。那么也就是说,比如处理注释的时候,先找叫comment的函数呗,
所以,handler中的函数命名(列表的组件名)是需要讲规律的
这些函数的第一个参数必须接受node(即所谓的take吧)。
嗯,这一段讲的很好~,把handler函数怎么编写讲清楚了。

 
再看个文档里面的一段
children
A list of the XML nodes at the top of the document. Each of these is of class XMLNode. These are made up of 4 fields.
name The name of the element.
attributes For regular elements, a named list of XML attributes converted from the <tag x="1" y="abc">
children List of sub-nodes.
value Used only for text entries.
Some nodes specializations of XMLNode, such as XMLComment, XMLProcessingInstruction, XMLEntityRef are used.
If the value of the argument getDTD is TRUE and the document refers to a DTD via a top-level DOCTYPE element, the DTD and its information will be available in the dtd field. The second element is a list containing the external and internal DTDs. Each of these contains 2 lists - one for element definitions and another for entities. See parseDTD.
If a list of functions is given via handlers, this list is returned. Typically, these handler functions share state via a closure and the resulting updated data structures which contain the extracted and processed values from the XML document can be retrieved via a function in this handler list.
If asTree is TRUE, then the converted tree is returned. What form this takes depends on what the handler functions have done to process the XML tree.
 
 
 
有意思的问答:How to write trycatch in R
 
========================================================================
四、样例
我们先看一下test.xml的内容:
  1. <?xml version="1.0"?>
  2. <!DOCTYPE foo [
  3. <!ENTITY % bar "for R and S">
  4. <!ENTITY % foo "for Omegahat">
  5. <!ENTITY testEnt "test entity bar">
  6. <!ENTITY logo SYSTEM "images/logo.gif" NDATA gif>
  7. <!ENTITY % extEnt SYSTEM "http://www.omegahat.net"><!-- include the contents of the README file in the same directory as this one. -->
  8. <!ELEMENT x (#PCDATA) >
  9. <!ELEMENT y (x)* >
  10. ]>
  11. <!-- A comment -->
  12. <foox="1">
  13. <elementattrib1="my value"/>
  14. &testEnt;
  15. <?R sum(rnorm(100))?>
  16. <a>
  17. <!-- A comment -->
  18. <b>
  19. %extEnt;
  20. </b>
  21. </a>
  22. <![CDATA[
  23. This is escaped data
  24. containing < and &.
  25. ]]>
  26. Note that this caused a segmentation fault if replaceEntities was
  27. not TRUE.
  28. That is,
  29. <code>
  30. xmlTreeParse("test.xml", replaceEntities = TRUE)
  31. </code>
  32. works, but
  33. <code>
  34. xmlTreeParse("test.xml")
  35. </code>
  36. does not if this is called before the one above.
  37. This is now fixed and was caused by
  38. treating an xmlNodePtr in the C code
  39. that had type XML_ELEMENT_DECL
  40. and so was in fact an xmlElementPtr.
  41. Aaah, C and casting!
  42. </foo>

 
 
  1. fileName <- system.file("exampleData","test.xml", package="XML")
  2. # parse the document and return it in its standard format.
  3. xmlTreeParse(fileName)
  4. # parse the document, discarding comments.
  5. xmlTreeParse(fileName, handlers=list("comment"=function(x,...){NULL}), asTree = TRUE)

这没什么好说的


 
  1. # print the entities
  2. invisible(xmlTreeParse(fileName,
  3. handlers=list(entity=function(x){
  4. cat("In entity",x$name, x$value,"\n")
  5. x}
  6. ), asTree = TRUE
  7. )
  8. )
什么是entry?好吧,检索了下笔记,当初自学XML的时候没学到....但愿本书后续章节会讲解~
其取名和值的方式好直接~所以我试验了,其实上面的处理函数中xmlName(node)直接改为node$name也是OK的。

 
 
 
  1. # Parse some XML text.
  2.  # Read the text from the file
  3.  xmlText <- paste(readLines(fileName),"\n", collapse="")
  4.  print(xmlText)
  5.  xmlTreeParse(xmlText, asText=TRUE)
  6.  # with version 1.4.2 we can pass the contents of an XML
  7.  # stream without pasting them.
  8.  xmlTreeParse(readLines(fileName), asText=TRUE)
这个也没什么可说的

 
 
  1. # Read a MathML document and convert each node
  2.  # so that the primary class is 
  3.  #   <name of tag>MathML
  4.  # so that we can use method  dispatching when processing
  5.  # it rather than conditional statements on the tag name.
  6.  # See plotMathML() in examples/.
  7.  fileName <- system.file("exampleData","mathml.xml",package="XML")
  8. m <- xmlTreeParse(fileName, 
  9.                   handlers=list(
  10.                    startElement = function(node){
  11.                    cname <- paste(xmlName(node),"MathML", sep="",collapse="")
  12.                    class(node)<- c(cname,class(node)); 
  13.                    node
  14.                 }))

这个功能有点意思,修改node的属性,将其第一个属性修改为标签名+MathML,之前的属性紧随其后,这样当我们调用的时候,就可以自动根据“标签名+MathML”属性调用泛型函数中对应的函数,这样就避免了我们还有使用if分支结构去筛选调用相应的方法。


 
 
 
  1. # In this example, we extract _just_ the names of the
  2. # variables in the mtcars.xml file. 
  3. # The names are the contents of the <variable>tags.
  4. # We discard all other tags by returning NULL
  5. # from the startElement handler.
  6. #
  7. # We cumulate the names of variables in a character vector named `vars'.
  8. # We define this within a closure and define the 
  9. # variable function within that closure so that it
  10. # will be invoked when the parser encounters a <variable> tag.
  11. # This is called with 2 arguments: the XMLNode object (containing its children) and
  12. # the list of attributes.
  13. # We get the variable name via call to xmlValue().
  14. # Note that we define the closure function in the call and then 
  15. # create an instance of it by calling it directly as
  16. #   (function() {...})()
  17. # Note that we can get the names by parsing
  18. # in the usual manner and the entire document and then executing
  19. # xmlSApply(xmlRoot(doc)[[1]], function(x) xmlValue(x[[1]]))
  20. # which is simpler but is more costly in terms of memory.
  21. fileName <- system.file("exampleData","mtcars.xml", package="XML")
  22. doc <- xmlTreeParse(fileName,  
  23.                     handlers =(function(){
  24.                               vars <- character(0);
  25.                               list(
  26.                                 variable=function(x, attrs){ 
  27.                                     vars <<- c(vars, xmlValue(x[[1]])); 
  28.                                     print(vars)
  29.                                 }, 
  30.                                 startElement=function(x,attr){
  31.                                     NULL
  32.                                 }, 
  33.                                 names = function(){
  34.                                     vars
  35.                                 }
  36.                               )
  37.                     })()
  38. )
  39. [1]"mpg"
  40. [1]"mpg""cyl"
  41. [1]"mpg"  "cyl"  "disp"
  42. [1]"mpg"  "cyl"  "disp""hp"  
  43. [1]"mpg"  "cyl"  "disp""hp"   "drat"
  44. [1]"mpg"  "cyl"  "disp""hp"   "drat""wt"  
  45. [1]"mpg"  "cyl"  "disp""hp"   "drat""wt"   "qsec"
  46. [1]"mpg"  "cyl"  "disp""hp"   "drat""wt"   "qsec""vs"  
  47. [1]"mpg"  "cyl"  "disp""hp"   "drat""wt"   "qsec""vs"   "am"  
  48.  [1]"mpg"  "cyl"  "disp""hp"   "drat""wt"   "qsec""vs"   "am"  "gear"
  49.  [1]"mpg"  "cyl"  "disp""hp"   "drat""wt"   "qsec""vs"   "am"  "gear""carb"
其中mtcars.xml中的数据如下:
该函数是取出variable节点中的值,然后将其余节点都删除
注意这种匿名函数的写法
handlers = (function(){})()

 
 
 
  1. # Here we just print the variable names to the console
  2. # with a special handler.
  3. doc <- xmlTreeParse(fileName, handlers = list(
  4.                                   variable=function(x, attrs){
  5.                                              print(xmlValue(x[[1]])); TRUE
  6.                                            }), asTree=TRUE)
其实我觉得哪里需要那几个多余的啊,直接这样效果是一样滴:
 
 
  1. doc <- xmlTreeParse(fileName, 
  2.                     handlers = list(variable=function(x, attrs){
  3.                                       print(xmlValue(x[[1]]))
  4.                                    })
  5.                     )
  6. [1]"mpg"
  7. [1]"cyl"
  8. [1]"disp"
  9. [1]"hp"
  10. [1]"drat"
  11. [1]"wt"
  12. [1]"qsec"
  13. [1]"vs"
  14. [1]"am"
  15. [1]"gear"
  16. [1]"carb"

 
# This should raise an error.

  1. try(xmlTreeParse(
  2.   system.file("exampleData","TestInvalid.xml", package="XML"),
  3.   validate=TRUE))
 
 
然而并没有发生错误

 
## Not run: 
 
  1. # Parse an XML document directly from a URL.
  2. # Requires Internet access.
  3. xmlTreeParse("http://www.omegahat.net/Scripts/Data/mtcars.xml", asText=TRUE)
  4.  
  5. Error: XML content does not seem to be XML:'http://www.omegahat.net/Scripts/Data/mtcars.xml'
是asText=TRUE参数再作怪~

 
 
 
  1. counter = function(){
  2.   counts = integer(0)
  3.   list(startElement = function(node){
  4.     name = xmlName(node)
  5.     if(name %in% names(counts))
  6.      
  7.     else
  8.       counts[name]<<-1
  9.   },
  10.   counts = function() counts)
  11. }
  12. h = counter()
  13. invisible(xmlParse(system.file("exampleData","mtcars.xml", package="XML"), 
  14.                    handlers = h)
  15.              )
  16.  
  17. h$counts()
  18. variable variables    record   dataset 
  19. 22       2         64       2 
这个处理函数的作用对不同节点进行计数
counts[name]这样的写法挺有意思的

 
 
 
  1. getLinks = function(){ 
  2.   links = character() 
  3.   list(a = function(node,...){ 
  4.     links <<- c(links, xmlGetAttr(node,"href"))
  5.     node 
  6.   }, 
  7.   links = function()links)
  8. }
  9. h1 = getLinks()
  10. invisible(htmlTreeParse(system.file("examples","index.html", package ="XML"),
  11.                         handlers = h1))
  12. h1$links()
  13. [1]"XML_0.97-0.tar.gz"                  
  14.  [2]"XML_0.97-0.zip"                     
  15.  [3]"XML_0.97-0.tar.gz"                  
  16.  [4]"XML_0.97-0.zip"                     
  17.  [5]"Overview.html"                      
  18.  [6]"manual.pdf"                         
  19.  [7]"Tour.pdf"                           
  20.  [8]"description.pdf"                    
  21.  [9]"WritingXML.html"                    
  22. [10]"FAQ.html"                           
  23. [11]"Changes"                            
  24. [12]"http://cm.bell-labs.com/stat/duncan"
  25. [13]"mailto:duncan@wald.ucdavis.edu"  
获取属性href的值
紧接其上:
 
  1. h2 = getLinks()
  2. htmlTreeParse(system.file("examples","index.html", package ="XML"),
  3.               handlers = h2, useInternalNodes = TRUE)
  4. all(h1$links()== h2$links())
  5. [1] TRUE

 
 
  1. # Using flat trees
  2. tt = xmlHashTree()
  3. f = system.file("exampleData","mtcars.xml", package="XML")
  4. xmlTreeParse(f, handlers = list(.startElement = tt[[".addNode"]]))
  5. ####输出了处理函数本身,加了asTree = TRUE貌似也没效果啊
  6. tt                  #这个是我自己加的命令
  7. <variable/>
  8. xmlRoot(tt)
  9. <variable/>
tt[[".addNode"]]是取到了xmlHashTree中的.addNode函数,可以直接通过这个命令查看
那么.addNode字面意思是添加节点,是怎么添加的呢?
先看xmlHashTree函数:
These (and related internal) functions allow us to represent trees as a simple, non-hierarchical collection of nodes along with corresponding tables that identify the parent and child relationships. 
这些函数,可以让用一个简单的非层次结构的节点集合来表示树,通过tables区分字父节点
(我观察函数名,怎么感觉是用哈希结构来存储树呢),再往后查找。
The function .addNode is used to insert a new node into the tree.
再看xmlRoot函数:
xmlRoot(x, skip = TRUE, ...)
These are a collection of methods for providing easy access to the top-level XMLNode object resulting from parsing an XML document
x
the object whose root/top-level XML node is to be returned.
也就是说,返回传入对象的根节点或者顶层节点。我们可以查看tt对象的属性来验证
 
  1. >class(tt)
  2. [1]"XMLHashTree"         "XMLAbstractDocument"
但这个.addNode到底怎么添加的呢?貌似这段代码什么都没做,而根节点却跑到tt对象里面去了.....
这点我没搞明白。
那么我们这样理解吧,tt作为一个全局变量,其本身是等于xmlHashTree(),即tt = xmlHashTree()
那么tt[[".addNode"]]即xmlHashTree()调用.addNode函数,而.startElement会在每一个节点调用,所以tt其实每次等于返回值,最后一次.addNode函数的返回值是<variable/>,所以,tt是<variable/>
那为什么.addNode的返回值,会是xmlHashTree()的返回值呢
 
 
  1. function (nodes = list(), parents = character(), children = list(), 
  2.     env = new.env(TRUE, parent = emptyenv())) 
  3. {
  4.     .count =0
  5.     env$.children =.children = new.env(TRUE)
  6.     env$.parents =.parents = new.env(TRUE)
  7.     f = function(suggestion =""){
  8.         if(suggestion ==""|| exists(suggestion, env, inherits = FALSE)) 
  9.             as.character(.count +1)
  10.         else suggestion
  11.     }
  12.     assign(".nodeIdGenerator", f, env)
  13.     addNode = function(node, parent = character(),..., attrs = NULL, 
  14.         namespace = NULL, namespaceDefinitions = character(), 
  15.         .children = list(...), cdata = FALSE, suppressNamespaceWarning = getOption("suppressXMLNamespaceWarning", 
  16.             FALSE)){
  17.         if(is.character(node)) 
  18.             node = xmlNode(node, attrs = attrs, namespace = namespace, 
  19.                 namespaceDefinitions = namespaceDefinitions)
  20.         .kids =.children
  21.         .children =.this$.children
  22.         node = asXMLTreeNode(node,.this, className ="XMLHashTreeNode")
  23.         id = node$id
  24.         assign(id, node, env)
  25.         .count <<-.count +1
  26.         if(!inherits(parent,"XMLNode")&&(!is.environment(parent)&& 
  27.             length(parent)==0)|| parent =="") 
  28.             return(node)
  29.         if(inherits(parent,"XMLHashTreeNode")) 
  30.             parent = parent$id
  31.         if(length(parent)){
  32.             assign(id, parent, envir =.parents)
  33.             if(exists(parent,.children, inherits = FALSE)) 
  34.                 tmp = c(get(parent,.children), id)
  35.             else tmp = id
  36.             assign(parent, tmp,.children)
  37.         }
  38.         return(node)
  39.     }
  40.     env$.addNode <- addNode
  41.     .tidy = function(){
  42.         idx <- idx -1
  43.         length(nodeSet)<- idx
  44.         length(nodeNames)<- idx
  45.         names(nodeSet)<- nodeNames
  46.         .nodes <<- nodeSet
  47.         idx
  48.     }
  49.     .this = structure(env,class= oldClass("XMLHashTree"))
  50.     .this
  51. }
好吧,感觉我是找对了线,但是还是不是很通顺,用过还是和变量的环境有关。
暂时搁置下。

 
 
 
  1. f = system.file("exampleData","mtcars.xml", package="XML")
  2. doc = xmlTreeParse(f, useInternalNodes = TRUE)
  3. sapply(getNodeSet(doc,"//variable"), xmlValue)
  4.  [1]"mpg"  "cyl"  "disp""hp"   "drat""wt"   "qsec""vs"   "am"  
  5. [10]"gear""carb"
其实就是从doc中其成绩variable标签对象,即得到的是节点,然后将节点传递xmlValue給函数,取得节点的值。

 
 
  1. # character set encoding for HTML
  2. f = system.file("exampleData","9003.html", package ="XML")
  3. # we specify the encoding
  4. d = htmlTreeParse(f, encoding ="UTF-8")
  5. # get a different result if we do not specify any encoding
  6. d.no = htmlTreeParse(f)
  7. # document with its encoding in the HEAD of the document.
  8. d.self = htmlTreeParse(system.file("exampleData","9003-en.html",package ="XML"))
  9. # XXX want to do a test here to see the similarities between d and
  10. # d.self and differences between d.no

 

关于编码解码

其中nodes1.xml
  1. <xxmlns:xinclude="http://www.w3.org/2001/XInclude">
  2. <!-- Simple test of including a set of nodes from an XML document -->
  3. <xinclude:includehref="something.xml#xpointer(//p)"/>
  4. </x>
其中nodes2.xml
 
  1. <xxmlns:xinclude="http://www.w3.org/2001/XInclude">
  2.    <!-- Simple test of including a set of nodes from an XML document -->
  3.    <xinclude:includehref="doesnt_exist.xml#xpointer(//p)">
  4.     <xinclude:fallback>
  5. Some <i>fallback text</i></xinclude:fallback>
  6.    </xinclude:include>
  7. </x>

 
 
 
  1. # include
  2.  f = system.file("exampleData","nodes1.xml", package ="XML")
  3. xmlRoot(xmlTreeParse(f, xinclude = FALSE))
  4. <x xmlns:xinclude="http://www.w3.org/2001/XInclude">
  5.  <!--Simple test of including a set of nodes from an XML document-->
  6.  <xinclude:include href="something.xml#xpointer(//p)"/>
  7. </x>
  8. xmlRoot(xmlTreeParse(f, xinclude = TRUE))
  9. <x xmlns:xinclude="http://www.w3.org/2001/XInclude">
  10.  <!--Simple test of including a set of nodes from an XML document-->
  11.  <p ID="author">something</p>
  12.  <p>really</p>
  13.  <p>simple</p>
  14. </x>
  15. f = system.file("exampleData","nodes2.xml", package ="XML")
  16. xmlRoot(xmlTreeParse(f, xinclude = TRUE))
  17. failed to load external entity "D:/RSets/R-3.3.2/library/XML/exampleData/doesnt_exist.xml"
  18. <x xmlns:xinclude="http://www.w3.org/2001/XInclude">
  19.  <!--Simple test of including a set of nodes from an XML document-->
  20.  Some
  21.  <i>fallback text</i>
  22. </x>
xinclude
a logical value indicating whether to process nodes of the form <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"> to insert content from other parts of (potentially different) documents. TRUE means resolve the external references; FALSE means leave the node as is. Of course, one can process these nodes oneself after document has been parse using handler functions or working on the DOM. Please note that the syntax for inclusion using XPointer is not the same as XPath and the results can be a little unexpected and confusing. See the libxml2 documentation for more details.
我们来看看所谓的resolve(分解)
<xinclude:include href="something.xml#xpointer(//p)"/>
即something.xml下的标签p,而something.xml文件如下
 
 
  1. <doc>
  2. <pID="author">something</p>
  3.  
  4. <p>really</p>
  5.  
  6. <foo>bar</foo>
  7.  
  8. <p>simple</p>
  9. </doc>
而在nodes2.xml的例子中
<xinclude:include href="doesnt_exist.xml#xpointer(//p)"><xinclude:fallback>
Some <i>fallback text</i></xinclude:fallback></xinclude:include>
是因为并没有一个叫做doesnt_exist.xml的文件,而第二个标签没有href

 
# Errors
 
 
  1. try(xmlTreeParse("<doc><a> & < <?pi ></doc>"))
  2. xmlParseEntityRef: no name
  3. StartTag: invalid element name
  4. ParsePI: PI pi never end ...
  5. Premature end of data in tag a line 1
  6. Premature end of data in tag doc line 1
  7. Error : 1: xmlParseEntityRef: no name
  8. 2: StartTag: invalid element name
  9. 3: ParsePI: PI pi never end ...
  10. 4: Premature end of data in tag a line 1
  11. 5: Premature end of data in tag doc line 1

 
# catch the error by type.

  1. tryCatch(xmlTreeParse("<doc><a> & < <?pi > </doc>"),
  2.           "XMLParserErrorList"= function(e){
  3.             cat("Errors in XML document\n", e$message,"\n")
  4.           }
  5. )
  6. xmlParseEntityRef: no name
  7. StartTag: invalid element name
  8. ParsePI: PI pi never end ...
  9. Premature end of data in tag a line 1
  10. Premature end of data in tag doc line 1
  11. Error : in XML document
  12. 1: xmlParseEntityRef: no name
  13. 2: StartTag: invalid element name
  14. 3: ParsePI: PI pi never end ...
  15. 4: Premature end of data in tag a line 1
  16. 5: Premature end of data in tag doc line 1
通过XMLParserErrorList函数,捕获参数e,即为是否有报错的标志,e$message提取到的是错误信息,即:1 2 3 4 5这些条.....

 
#  terminate on first error            
 
  1. try(xmlTreeParse("<doc><a> & < <?pi > </doc>", error = NULL))
  2. Error: xmlParseEntityRef: no name

 
  1. f = system.file("exampleData","book.xml", package ="XML")
  2. doc.trim = xmlInternalTreeParse(f, trim = TRUE)
  3. doc = xmlInternalTreeParse(f, trim = FALSE)
  4. xmlSApply(xmlRoot(doc.trim),class)
  5.      chapter                  chapter                 
  6. [1,]"XMLInternalElementNode""XMLInternalElementNode"
  7. [2,]"XMLInternalNode"        "XMLInternalNode"       
  8. [3,]"XMLAbstractNode"        "XMLAbstractNode"       
  9. xmlSApply(xmlRoot(doc),class)
  10.      text                  chapter                 
  11. [1,]"XMLInternalTextNode""XMLInternalElementNode"
  12. [2,]"XMLInternalNode"     "XMLInternalNode"       
  13. [3,]"XMLAbstractNode"     "XMLAbstractNode"       
  14.      text                  chapter                 
  15. [1,]"XMLInternalTextNode""XMLInternalElementNode"
  16. [2,]"XMLInternalNode"     "XMLInternalNode"       
  17. [3,]"XMLAbstractNode"     "XMLAbstractNode"       
  18.      text                 
  19. [1,]"XMLInternalTextNode"
  20. [2,]"XMLInternalNode"    
  21. [3,]"XMLAbstractNode"

神奇的是,xmlInternalTreeParse函数虽然也在该帮助文档页面,但是一点相关的说明都没有....

trim参数
whether to strip white space from the beginning and end of text strings.
是否清楚开始和结束文本字符之间的空格
book,xml中如下,而doc的输出和其一致

而doc.trim的输出是这样子的:啊,感觉其把原来紧凑的都給空格隔开啦

 感觉是不是这个trim参数写反了啊?
反正我没看懂.............

 
# Storing nodes
 
 
  1. f = system.file("exampleData","book.xml", package ="XML")
  2. titles = list()
  3. xmlTreeParse(f, handlers = list(title = function(x)
  4.   ]]<<- x))
  5.  
  6. $title #此为输出
  7. function (x) 
  8. titles[[length(titles)+1]]<<- x
  9. sapply(titles, xmlValue)
  10. [1]"XML"                            
  11. [2]"The elements of an XML document"
  12. [3]"Parsing XML"                    
  13. [4]"DOM"                            
  14. [5]"SAX"                            
  15. [6]"XSL"                            
  16. [7]"templates"                      
  17. [8]"XPath expressions"              
  18. [9]"named templates"                
  19. rm(titles)
这个写法有点意思
titles[[length(titles)+1]]<<- x))

 
 
 
 
 

附件列表

R自动数据收集第二章HTML笔记1(主要关于handler处理器函数和帮助文档所有示例)的更多相关文章

  1. R自动数据收集第二章HTML笔记2(主要关于htmlTreeParse函数)

    包含以下几个小的知识点 1htmlTreeParse函数源码和一些参数 2hander的写法 3关于missing函数 4关于if-else语句中else语句的花括号问题 5关于checkHandle ...

  2. R自动数据收集第一章概述——《List of World Heritage in Danger》

      导包     library(stringr) library(XML) library(maps) heritage_parsed <- htmlParse("http://en ...

  3. AS开发实战第二章学习笔记——其他

    第二章学习笔记(1.19-1.22)像素Android支持的像素单位主要有px(像素).in(英寸).mm(毫米).pt(磅,1/72英寸).dp(与设备无关的显示单位).dip(就是dp).sp(用 ...

  4. #Spring实战第二章学习笔记————装配Bean

    Spring实战第二章学习笔记----装配Bean 创建应用对象之间协作关系的行为通常称为装配(wiring).这也是依赖注入(DI)的本质. Spring配置的可选方案 当描述bean如何被装配时, ...

  5. CentOS6安装各种大数据软件 第二章:Linux各个软件启动命令

    相关文章链接 CentOS6安装各种大数据软件 第一章:各个软件版本介绍 CentOS6安装各种大数据软件 第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件 第三章:Linux基础 ...

  6. Machine Learning In Action 第二章学习笔记: kNN算法

    本文主要记录<Machine Learning In Action>中第二章的内容.书中以两个具体实例来介绍kNN(k nearest neighbors),分别是: 约会对象预测 手写数 ...

  7. Day2 《机器学习》第二章学习笔记

    这一章应该算是比价了理论的一章,我有些概率论基础,不过起初有些地方还是没看多大懂.其中有些公式的定义和模型误差的推导应该还是很眼熟的,就是之前在概率论课上提过的,不过有些模糊了,当时课上学得比较浅. ...

  8. Python核心编程第三版第二章学习笔记

    第二章 网络编程 1.学习笔记 2.课后习题 答案是按照自己理解和查阅资料来的,不保证正确性.如由错误欢迎指出,谢谢 1. 套接字:A network socket is an endpoint of ...

  9. Linux第一章第二章学习笔记

    第一章 Linux内核简介 1.1 Unix的历史 它是现存操作系统中最强大最优秀的系统. 设计简洁,在发布时提供原代码. 所有东西都被当做文件对待. Unix的内核和其他相关软件是用C语言编写而成的 ...

随机推荐

  1. WPF 自定义标题栏 自定义菜单栏

    自定义标题栏 自定义列表,可以直接修改WPF中的ListBox模板,也用这样类似的效果.但是ListBox是不能设置默认选中状态的. 而我们需要一些复杂的UI效果,还是直接自定义控件来的快 GitHu ...

  2. PL/SQL配置Oracle数据库路径

    打开PL/SQL-Tools->Preferences-Orcacle->Connecttion 找到配置路径,打开-product\instantclient_11_2\NETWORK\ ...

  3. zip命令的基本用法

    zip命令的基本用法是: zip [参数] [打包后的文件名] [打包的目录路径] linux zip命令参数列表: -a 将文件转成ASCII模式 -F 尝试修复损坏的压缩文件 -h 显示帮助界面  ...

  4. [转]Code! MVC 5 App with Facebook, Twitter, LinkedIn and Google OAuth2 Sign-on (C#)

    本文转自:https://www.asp.net/mvc/overview/security/create-an-aspnet-mvc-5-app-with-facebook-and-google-o ...

  5. 【译】什么是 web 框架?

    Web 应用框架,或者简单的说是“Web 框架”,其实是建立 web 应用的一种方式.从简单的博客系统到复杂的富 AJAX 应用,web 上每个页面都是通过写代码来生成的.我发现很多人都热衷于学习 w ...

  6. heredoc技术

    Heredoc技术,在正规的PHP文档中和技术书籍中一般没有详细讲述,只是提到了这是一种Perl风格的字符串输出技术.但是现在的一些论坛程序,和部分文章系统,都巧妙的使用heredoc技术,来部分的实 ...

  7. Ubuntu apache2.4 设置虚拟主机

    每次重装系统如何配置都上网找,搞半天,都是不对的,还不如自己记下来,以作参考呢.我的项目目录是 /home/feiffy/demo/test,映射的域名是 test.com,这样在浏览器输入 test ...

  8. 延迟加载外部js文件,延迟加载图片(jquery.lazyload.js和echo,js)

    js里一说到延迟加载,大都离不开两种情形,即外部Js文件的延迟加载,以及网页图片的延迟加载: 1.首先简单说一下js文件的3种延迟加载方式: (1)<script type="text ...

  9. mybatis返回数据类型为map,值为null的key没返回

    创建mybatis-config.xml <?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE ...

  10. 【USACO 3.2】Spinning Wheels(同心圆旋转)

    题意: 5个同心圆,告诉你角速度,每个圆有1至5个楔,告诉你起点和宽度.求最早时间如果有的话使得存在某个角度经过5个圆的楔. 题解: 最重要的是要意识到,360秒钟后,每个圆都回到了原来的位置. 我的 ...