Beautifulsoup4

kindEditor

1 官网：http://kindeditor.net/doc.php

2 文件夹说明：

├── asp                          asp示例

├── asp.net                    asp.net示例

├── attached                  空文件夹，放置关联文件attached

├── examples                 HTML示例

├── jsp                          java示例

├── kindeditor-all-min.js 全部JS（压缩）

├── kindeditor-all.js        全部JS（未压缩）

├── kindeditor-min.js      仅KindEditor JS（压缩）

├── kindeditor.js            仅KindEditor JS（未压缩）

├── lang                        支持语言

├── license.txt               License

├── php                        PHP示例

├── plugins                    KindEditor内部使用的插件

└── themes                   KindEditor主题

3 基本使用

 1 <script src="/static/kindeditor-4.1.10/kindeditor-all.js"></script>

 2 <script>

 3     KindEditor.create('#i1',{

 4         width:'300px',

 5         height:'200px',

 6         items:['source','indent','bold','image','link'],

 7         filterMode:true,

 8         htmlTags:{ span : ['.color', '.background-color' ]},

 9         resizeType:2,

10         themeType:'default',

11         designMode:false,

12         noDisableItems:['source','bold'],

13 {#        自定义上传文件的名字，路径，额外的参数#}

14         filePostName:'fafafa',

15         uploadJson:'/upload_img.html',

16         extraFileUploadParams:{

17             'csrfmiddlewaretoken':'{{ csrf_token }}'

18         }

19     })

20 </script>

4 详细参数

http://kindeditor.net/doc3.php?cmd=config

5 评论框示例

1 <div class="commentarea2">

 2         <h4>发表评论</h4>

 3         <form  novalidate>

 4              昵称：

 5              <input type="text" value="{{ dict.username }}" class="hide i1">

 6              <input type="text" value="" class="hide i2">

 7              <textarea id="content"></textarea>

 8              <input id='i3' type="submit" value="提交评论">

 9              <a href="/exit/" class="hide a1">退出</a>

10              <a href="/login/" class="hide a2">登录</a>

11         </form>

12     </div>

13

14

15 <script src="/static/kindeditor-4.1.10/kindeditor-all.js"></script>

16     <script>

17     $(function(){

18                if($('.i1').val()=='None'){

19                    $('.i2').removeClass('hide');

20                    $('.a2').removeClass('hide');

21                }else{

22                    $('.i1').removeClass('hide');

23                    $('.a1').removeClass('hide');

24                }

25 }）

26

27 KindEditor.create('#content',{

28         width:'50%',

29         height:'50px',

30         resizeType:0,

31         items:['source','indent','bold','image','link'],

32         filePostName:'fafafa',

33         uploadJson:'/upload_img.html',

34         extraFileUploadParams:{'csrfmiddlewaretoken':'{{ csrf_token }}'},

35         afterBlur: function(){this.sync();}

36     });

37

38     $('#i3').click(function(){

39         var comment=$('#content').val();

40         alert(comment);

41         var article_id=$('#article_id').val();

42         $.ajax({

43             url:'/add_comment.html',

44             type:'post',

45             data:{'username':'{{ dict.username }}','article_id':article_id,'comment':comment,'csrfmiddlewaretoken':'{{ csrf_token }}'},

46             dataType:'JSON',

47             success:function (data) {

48                 alert(data);

49                 location.reload();

50             }

51         })

52     })

53     </script>

 1 def upload_img(request):

 2     upload_type=request.GET.get('dir')    #查看上传过来的文件类型

 3     file_obj=request.FILES.get('fafafa')

 4     file_path=os.path.join('static/img',file_obj.name)

 5     with open(file_path,'wb') as f:

 6         for chunk in file_obj.chunks():

 7             f.write(chunk)

 8     #返回编辑器认识的数据类型（图片保存的路径）

 9     dic = {

10         'error': 0,

11         'url': '/' + file_path,

12         'message': '错误了...'

13     }

14

15     return HttpResponse(json.dumps(dic))

提交文章评论时，尽量用form表单提交，会自动刷新网页，更新评论楼
利用ajax提交需要设置kindeditor，并且也需要在ajax中设置刷新本网页ajax location.href()

利用kindeditor装饰textarea时，
form表单提交时from表单会自动从kindeditor中获取textarea的值
但是用jquery提交数据时，需要添加 KindEditor.create('',{ afterBlur: function(){this.sync();} })
目的是在editor失去焦点时，执行一个函数，将editor获取的值同步到textarea中

应用场景：添加新随笔，评论
提交文件的内部原理是：该插件会自动生成一个iframe标签，上传图片时利用伪ajax提交数据
前端：<script src="/static/kindeditor-4.1.10/kindeditor-all.js"></script>
KindEditor.create('#i1',{
filePostName:'fafafa', 指定上传的文件的名字
uploadJson:'/upload_img.html', 指定上传文件的路径
extraFileUploadParams:{ 指定上传文件所带的额外的参数（伪ajax携带CSRF）
'csrfmiddlewaretoken':'{{ csrf_token }}'
}
})

后端：
request.GET.get('dir') 查看上传过来的文件类型
dic={ 返回kindeditor认识的数据类型（可进行预览）
'error':0,
'url':'/'+filepath,
'message':'错误'
}

7 xss过滤特殊标签

BeautifulSoup是一个模块，该模块用于接收一个HTML或XML字符串，然后将其进行格式化，之后遍可以使用他提供的方法进行快速查找指定元素，从而使得在HTML或XML中查找指定元素变得简单。

处理依赖

1	`pip3 install beautifulsoup4`

使用示例：

from bs4 import BeautifulSoup

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

...

</body>

</html>

"""

soup = BeautifulSoup(html_doc, features="lxml")

1. name，标签名称

# tag = soup.find('a')

# name = tag.name # 获取

# print(name)

# tag.name = 'span' # 设置

# print(soup)

2. attr，标签属性

# tag = soup.find('a')

# attrs = tag.attrs # 获取

# print(attrs)

# tag.attrs = {'ik':123} # 设置

# tag.attrs['id'] = 'iiiii' # 设置

# print(soup)

3. children,所有子标签

1 2	`# body = soup.find('body')` `# v = body.children`

4. children,所有子子孙孙标签

1 2	`# body = soup.find('body')` `# v = body.descendants`

5. clear,将标签的所有子标签全部清空（保留标签名）

# tag = soup.find('body')

# tag.clear()

# print(soup)

6. decompose,递归的删除所有的标签

# body = soup.find('body')

# body.decompose()

# print(soup)

7. extract,递归的删除所有的标签，并获取删除的标签

# body = soup.find('body')

# v = body.extract()

# print(soup)

8. decode,转换为字符串（含当前标签）；decode_contents（不含当前标签）

# body = soup.find('body')

# v = body.decode()

# v = body.decode_contents()

# print(v)

9. encode,转换为字节（含当前标签）；encode_contents（不含当前标签）

# body = soup.find('body')

# v = body.encode()

# v = body.encode_contents()

# print(v)

10. find,获取匹配的第一个标签

# tag = soup.find('a')

# print(tag)

# tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')

# tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie')

# print(tag)

11. find_all,获取匹配的所有标签

# tags = soup.find_all('a')

# print(tags)

# tags = soup.find_all('a',limit=1)

# print(tags)

# tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')

# # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')

# print(tags)

# ####### 列表 #######

# v = soup.find_all(name=['a','div'])

# print(v)

# v = soup.find_all(class_=['sister0', 'sister'])

# print(v)

# v = soup.find_all(text=['Tillie'])

# print(v, type(v[0]))

# v = soup.find_all(id=['link1','link2'])

# print(v)

# v = soup.find_all(href=['link1','link2'])

# print(v)

# ####### 正则 #######

import re

# rep = re.compile('p')

# rep = re.compile('^p')

# v = soup.find_all(name=rep)

# print(v)

# rep = re.compile('sister.*')

# v = soup.find_all(class_=rep)

# print(v)

# rep = re.compile('http://www.oldboy.com/static/.*')

# v = soup.find_all(href=rep)

# print(v)

# ####### 方法筛选 #######

# def func(tag):

# return tag.has_attr('class') and tag.has_attr('id')

# v = soup.find_all(name=func)

# print(v)

# ## get,获取标签属性

# tag = soup.find('a')

# v = tag.get('id')

# print(v)

12. has_attr,检查标签是否具有该属性

# tag = soup.find('a')

# v = tag.has_attr('id')

# print(v)

13. get_text,获取标签内部文本内容

# tag = soup.find('a')

# v = tag.get_text('id')

# print(v)

14. index,检查标签在某标签中的索引位置

# tag = soup.find('body')

# v = tag.index(tag.find('div'))

# print(v)

# tag = soup.find('body')

# for i,v in enumerate(tag):

# print(i,v)

15. is_empty_element,是否是空标签(是否可以是空)或者自闭合标签，

判断是否是如下标签：'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base'

# tag = soup.find('br')

# v = tag.is_empty_element

# print(v)

16. 当前的关联标签

# soup.next

# soup.next_element

# soup.next_elements

# soup.next_sibling

# soup.next_siblings

#

# tag.previous

# tag.previous_element

# tag.previous_elements

# tag.previous_sibling

# tag.previous_siblings

#

# tag.parent

# tag.parents

17. 查找某标签的关联标签

# tag.find_next(...)

# tag.find_all_next(...)

# tag.find_next_sibling(...)

# tag.find_next_siblings(...)

# tag.find_previous(...)

# tag.find_all_previous(...)

# tag.find_previous_sibling(...)

# tag.find_previous_siblings(...)

# tag.find_parent(...)

# tag.find_parents(...)

# 参数同find_all

18. select,select_one, CSS选择器

soup.select("title")

soup.select("p nth-of-type(3)")

soup.select("body a")

soup.select("html head title")

tag = soup.select("span,a")

soup.select("head > title")

soup.select("p > a")

soup.select("p > a:nth-of-type(2)")

soup.select("p > #link1")

soup.select("body > a")

soup.select("#link1 ~ .sister")

soup.select("#link1 + .sister")

soup.select(".sister")

soup.select("[class~=sister]")

soup.select("#link1")

soup.select("a#link2")

soup.select('a[href]')

soup.select('a[href="http://example.com/elsie"]')

soup.select('a[href^="http://example.com/"]')

soup.select('a[href$="tillie"]')

soup.select('a[href*=".com/el"]')

from bs4.element import Tag

def default_candidate_generator(tag):

for child in tag.descendants:

if not isinstance(child, Tag):

continue

if not child.has_attr('href'):

continue

yield child

tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator)

print(type(tags), tags)

from bs4.element import Tag

def default_candidate_generator(tag):

for child in tag.descendants:

if not isinstance(child, Tag):

continue

if not child.has_attr('href'):

continue

yield child

tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator, limit=1)

print(type(tags), tags)

19. 标签的内容

# tag = soup.find('span')

# print(tag.string) # 获取

# tag.string = 'new content' # 设置

# print(soup)

# tag = soup.find('body')

# print(tag.string)

# tag.string = 'xxx'

# print(soup)

# tag = soup.find('body')

# v = tag.stripped_strings # 递归内部获取所有标签的文本

# print(v)

20.append在当前标签内部追加一个标签

# tag = soup.find('body')

# tag.append(soup.find('a'))

# print(soup)

#

# from bs4.element import Tag

# obj = Tag(name='i',attrs={'id': 'it'})

# obj.string = '我是一个新来的'

# tag = soup.find('body')

# tag.append(obj)

# print(soup)

21.insert在当前标签内部指定位置插入一个标签

# from bs4.element import Tag

# obj = Tag(name='i', attrs={'id': 'it'})

# obj.string = '我是一个新来的'

# tag = soup.find('body')

# tag.insert(2, obj)

# print(soup)

22. insert_after,insert_before 在当前标签后面或前面插入

# from bs4.element import Tag

# obj = Tag(name='i', attrs={'id': 'it'})

# obj.string = '我是一个新来的'

# tag = soup.find('body')

# # tag.insert_before(obj)

# tag.insert_after(obj)

# print(soup)

23. replace_with 在当前标签替换为指定标签

# from bs4.element import Tag

# obj = Tag(name='i', attrs={'id': 'it'})

# obj.string = '我是一个新来的'

# tag = soup.find('div')

# tag.replace_with(obj)

# print(soup)

24. 创建标签之间的关系

# tag = soup.find('div')

# a = soup.find('a')

# tag.setup(previous_sibling=a)

# print(tag.previous_sibling)

25. wrap，将指定标签把当前标签包裹起来

# from bs4.element import Tag

# obj1 = Tag(name='div', attrs={'id': 'it'})

# obj1.string = '我是一个新来的'

#

# tag = soup.find('a')

# v = tag.wrap(obj1)

# print(soup)

# tag = soup.find('a')

# v = tag.wrap(soup.find('p'))

# print(soup)

26. unwrap，去掉当前标签，将保留其包裹的标签

# tag = soup.find('a')

# v = tag.unwrap()

# print(soup)

后台插件过滤：

 1 from bs4 import BeautifulSoup

 2

 3 def xss(content):

 4

 5     valid_tag={

 6         'p':['class','id'],

 7         'img':['href','alt','src'],

 8         'div':['class']

 9     }

10

11     soup=BeautifulSoup(content,'html.parser')

12

13     tags=soup.find_all()

14     for tag in tags:

15         if tag.name not in valid_tag:

16             tag.decompose()

17         if tag.attrs:

18             for k in list(tag.attrs.keys()):

19                 if k not in valid_tag[tag.name]:

20                     del tag.attrs[k]

21

22     content_str=soup.decode()

23     return content_str

基于__new__实现单例模式示例：

 1 from bs4 import BeautifulSoup

 2

 3

 4 class XSSFilter(object):

 5     __instance = None

 6

 7     def __init__(self):

 8         # XSS白名单

 9         self.valid_tags = {

10             "font": ['color', 'size', 'face', 'style'],

11             'b': [],

12             'div': [],

13             "span": [],

14             "table": [

15                 'border', 'cellspacing', 'cellpadding'

16             ],

17             'th': [

18                 'colspan', 'rowspan'

19             ],

20             'td': [

21                 'colspan', 'rowspan'

22             ],

23             "a": ['href', 'target', 'name'],

24             "img": ['src', 'alt', 'title'],

25             'p': [

26                 'align'

27             ],

28             "pre": ['class'],

29             "hr": ['class'],

30             'strong': []

31         }

32

33     def __new__(cls, *args, **kwargs):

34         """

35         单例模式

36         :param cls:

37         :param args:

38         :param kwargs:

39         :return:

40         """

41         if not cls.__instance:

42             obj = object.__new__(cls, *args, **kwargs)

43             cls.__instance = obj

44         return cls.__instance

45

46     def process(self, content):

47         soup = BeautifulSoup(content, 'lxml')

48         # 遍历所有HTML标签

49         for tag in soup.find_all(recursive=True):

50             # 判断标签名是否在白名单中

51             if tag.name not in self.valid_tags:

52                 tag.hidden = True

53                 if tag.name not in ['html', 'body']:

54                     tag.hidden = True

55                     tag.clear()

56                 continue

57             # 当前标签的所有属性白名单

58             attr_rules = self.valid_tags[tag.name]

59             keys = list(tag.attrs.keys())

60             for key in keys:

61                 if key not in attr_rules:

62                     del tag[key]

63

64         return soup.renderContents()

65

66

67 if __name__ == '__main__':

68     html = """<p class="title">

69                         <b>The Dormouse's story</b>

70                     </p>

71                     <p class="story">

72                         <div name='root'>

73                             Once upon a time there were three little sisters; and their names were

74                             <a href="http://example.com/elsie" class="sister c1" style='color:red;background-color:green;' id="link1"><!-- Elsie --></a>

75                             <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

76                             <a href="http://example.com/tillie" class="sister" id="link3">Tilffffffffffffflie</a>;

77                             and they lived at the bottom of a well.

78                             <script>alert(123)</script>

79                         </div>

80                     </p>

81                     <p class="story">...</p>"""

82

83     obj = XSSFilter()

84     v = obj.process(html)

85     print(v)

Beautifulsoup4的更多相关文章

爬虫笔记(四)------关于BeautifulSoup4解析器与编码
前言:本机环境配置:ubuntu 14.10,python 2.7,BeautifulSoup4 一.解析器概述如同前几章笔记,当我们输入: soup=BeautifulSoup(response. ...
使用pip安装BeautifulSoup4模块
1.测试是否安装了BeautifulSoup4模块 import bs4 print bs4 执行报错说明没有安装该模块 Traceback (most recent call last): File ...
python3.4学习笔记(十七) 网络爬虫使用Beautifulsoup4抓取内容
python3.4学习笔记(十七) 网络爬虫使用Beautifulsoup4抓取内容 Beautiful Soup 是用Python写的一个HTML/XML的解析器,它可以很好的处理不规范标记并生成剖 ...
windows下安装beautifulsoup4
方法一: pip install beautifulsoup4 方法二: 在官网下载安装包后,放在python目录下--运行cmd--进入bs4安装包路径--输入setup.py install 测试 ...
Python3利用BeautifulSoup4批量抓取站点图片的代码
边学边写代码,记录下来.这段代码用于批量抓取主站下所有子网页中符合特定尺寸要求的的图片文件,支持中断. 原理很简单:使用BeautifulSoup4分析网页,获取网页<a/>和<im ...
python BeautifulSoup4
source form http://www.bkjia.com/ASPjc/908009.html 昨天把传说中的BeautifulSoup4装上了,还没有装好的童鞋,请看本人的上一篇博客: Py ...
安装beautifulsoup4
python scripts下 pip install beautifulsoup4
Python: 安装BeautifulSoup4
python3.4.3 安装BeautifulSoup4: 使用pip install 安装: 在命令行cmd之后输入,pip install BeautifulSoup4 BeautifulSoup ...
Python BeautifulSoup4 使用指南
前言: 昨天把传说中的BeautifulSoup4装上了,还没有装好的童鞋,请看本人的上一篇博客: Python3 Win7安装 BeautifulSoup,依照里面简单的步骤就能够把Beautifu ...
【安装】beautifulsoup4—美丽汤的安装
beautifulsoup俗称美丽汤,是用来爬虫用的,大家可以到这个网址去下载.注意,要根据对应的python版本来下载. 下载传送: https://pypi.python.org/pypi/be ...

随机推荐

挂载U盘和移动硬盘
1, 挂载U盘和USB接口的移动硬盘一样对linux系统而言U盘也是当作SCSI设备对待的.使用方法和移动硬盘完全一样.插入U盘之前[root at pldyrouter root]# fdisk - ...
兄弟连教育分享：用CSS实现鼠标悬停提示的方法
兄弟连教育分享:用CSS实现鼠标悬停提示的方法本文,兄弟连HTML5培训,分享了纯CSS实现鼠标悬停提示的方法.给大家供大家参考.具体分析如下: 这是一款比较漂亮的鼠标悬停提示效果,用纯CSS代码实 ...
在oracle中，group by后将字符拼接，以及自定义排序
1.在oracle中,group by后将字符拼接.任务:在学生表中,有studentid和subject两个字段.要求对studentid进行group by分组,并将所选科目拼接在一起.oracl ...
笔记：Jersey REST 传输格式
通常REST接口会以XML或JSON作为主要传输格式,同时 Jersey 也支持其他的数据格式,比如基本类型.文件.流等格式. 基本类型 Java的基本类型又叫原生类型,包括4种整数(byte.sho ...
（转）关于 awk 的 pattern(模式)
本文转自chinaunix http://bbs.chinaunix.net/thread-4246512-1-1.html 作者reyleon 我们知道, awk程序由一系列 pattern 以 ...
Java编程配置思路详解
Java编程配置思路详解 SpringBoot虽然提供了很多优秀的starter帮助我们快速开发,可实际生产环境的特殊性,我们依然需要对默认整合配置做自定义操作,提高程序的可控性,虽然你配的不一定比官 ...
CSS速查列表-3-(font)字体
CSS Fonts(字体) CSS字体属性定义 1.字体:font-family 属性设置文本的字体系列.p{font-family:"Times New Roman", Time ...
WCF跨域解决方法及一些零碎的东西。
之前发过一篇随笔,说的WCF配置文件配置问题.里面也配了跨域支持,但是jsoncollback只支持Get请求,Post请求是解决不了,所以这里把真正的WCF跨域问题贴出来. 话不多说,直接帖配置文件 ...
漫谈Java IO之普通IO流与BIO服务器
今天来复习一下基础IO,也就是最普通的IO. 网络IO的基本知识与概念普通IO以及BIO服务器 NIO的使用与服务器Hello world Netty的使用与服务器Hello world 输入流与输 ...
多目标跟踪(MOT)论文随笔-SIMPLE ONLINE AND REALTIME TRACKING (SORT)
网上已有很多关于MOT的文章,此系列仅为个人阅读随笔,便于初学者的共同成长.若希望详细了解,建议阅读原文. 本文是使用 tracking by detection 方法进行多目标跟踪的文章,是后续de ...

Beautifulsoup4

kindEditor

Beautifulsoup4的更多相关文章

随机推荐

热门专题