十三、CSS选择器:BeautifulSoup4
2021-03-09 15:28
YPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
标签:pytho 官方文档 HERE 方法 html标签 XML from ext --
(1)和lxml一样,Beautifu Soup也是一个HTML/XML的解析器,主要的功能也是如何解析和提取HTML/XML数据。
(2)lxml只会局部遍历,而Beautiful Soup是基于HTML DOM的,会载入整个文档,解析整个DOM树,因此时间和内存开销都会大很多,所以性能要低于lxml。
(3)BeautifulSoup用来解析HTML比较简单,API非常人性化,支持CSS选择器、python标准库中的HTML解析器,也支持lxml的XML解析器。
安装:`pip install beautifusoup4`
官方文档:https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
抓取工具 | 速度 | 使用难度 | 安装难度 |
---|---|---|---|
正则 | 最快 | 困难 | 无(内置) |
BeautifulSoup | 慢 | 最简单 | 简单 |
lxml | 快 | 简单 | 一般 |
1、示例
from bs4 import BeautifulSoup html = """The Dormouse‘s story The Dormouse‘s story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
""" # 创建Beautiful Soup对象 soup = BeautifulSoup(html,‘lxml‘) # 打开本地HTML文件的方式来创建对象 # soup = BeautifulSoup(open(‘index.html‘)) # 格式化输出soup对象的内容 print(soup.prettify())
运行结果:
The Dormouse‘s story class="title" name="dromouse"> The Dormouse‘s story
class="story"> Once upon a time there were three little sisters; and their names were class="sister" href="http://example.com/elsie" id="link1"> , class="sister" href="http://example.com/lacie" id="link2"> Lacie and class="sister" href="http://example.com/tillie" id="link3"> Tillie ; and they lived at the bottom of a well.
class="story"> ...
如果没有显示地指定解析器,会默认使用这个系统的最佳可用HTML解析器(‘lxml‘)。当在另一个系统中运行这段代码,或者在不同的虚拟环境中,使用不同的解析器会造成不同行为。
可以通过`soup=BeautifuSoup(html,‘lxml‘)`方式指定lxml解析器。
2、四大对象种类
Beautifu Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是python对象,所有对象可以归纳为4种:(1)Tag(2)NavigableString(3)BeautifulSoup(4)Comment
2.1 Tag
The Dormouse‘s story class="sister" href="http://example.com/elsie" id="link1">class="title" name="dromouse">The Dormouse‘s story
Tag是,HTML中的一个个标签(即上面代码中的`title`、`head`、`a`、`p`等等HTML标签)加上里面包括的内容。
from bs4 import BeautifulSoup html = """The Dormouse‘s story The Dormouse‘s story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
""" # 创建Beautiful Soup对象 soup = BeautifulSoup(html,‘lxml‘) # 打开本地HTML文件的方式来创建对象 # soup = BeautifulSoup(open(‘index.html‘)) # 格式化输出soup对象的内容 # print(soup.prettify()) print(soup.title) #The Dormouse‘s story print(soup.head) #The Dormouse‘s story print(soup.a) # print(type(soup.a)) #
# 通过soup加标签名获取这些标签的内容,这些对象的类型是bs4.element.Tag
# 通过这种方法查找的是在所有内容中的第一个符合要求的标签。
# 对于Tag,它本身有两个重要的属性,即name和attrs
print(soup.name) # [document] # soup对象本身比较特殊,它的name即为[document] print(soup.head.name) # head # 对于其他内部标签,输出的值便为标签本身的名称 print(soup.p.attrs) # {‘class‘: [‘title‘], ‘name‘: ‘dromouse‘} # 在这里,我们把p标签的所有属性打印了出来,得到的类型是一个字典 print(soup.p[‘class‘]) # [‘title’] 获取属性的值 # 等同下列get方法 print(soup.p.get(‘class‘)) # [‘title‘] 获取属性的值 soup.p[‘class‘] = ‘newClass‘
# 对p标签下的class属性的内容进行修改 print(soup.p) #The Dormouse‘s story
del soup.p[‘class‘] # 还可以对这个属性进行删除
print soup.p
# The Dormouse‘s story
2.2 NavigableString
通过.string的方式即可获取标签内部的文字
print soup.p.string # The Dormouse‘s story print type(soup.p.string) # In [13]:
2.3 BeautifulSoup
BeautifulSoup对象表示的是一个文档的内容,可以把它当做是一个特殊的Tag对象,可以分别获取它的类型,名称以及属性。
print(type(soup.name)) #print(soup.name) # [document] print(soup.attrs) # {} 文档本身的属性为空
2.4 Comment
Comment对象是一个特殊类型的NavigableString对象,其输出的内容不包括注释符号。
print(soup.a) # print(soup.a.string) # Elsie print(type(soup.a.string)) #
注意Comment和NavigableString对象的区别,当HTML标签的.string中有注释时,忽视注释符号,返回其中的内容,这时它是一个Comment对象;当没有注释时,返回其中的内容,这时它是一个NavigableString对象。
3、遍历文档树
3.1 直接子节点:`.contents`,`.children`属性
(1)`.content`属性
Tag的`.contents`属性可以将Tag的子节点以列表的方式输出
print(soup.body.contents) # tag的.contents属性可以将tag的子节点以列表的方式输出 """ [‘\n‘,The Dormouse‘s story
, ‘\n‘,Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
"""
3.2 所有子孙节点:`.descendants`属性
`.contents`和`.children`属性仅包含Tag的直接子节点,`.descendants`属性可以对所有Tag的子孙节点进行递归循环,和`.children`类似,返回一个生成器对象。
print(soup.descendants) #for child in soup.descendants: print(child) """ The Dormouse‘s story The Dormouse‘s story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
The Dormouse‘s story The Dormouse‘s story The Dormouse‘s storyThe Dormouse‘s story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
The Dormouse‘s story
The Dormouse‘s story The Dormouse‘s storyOnce upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
Once upon a time there were three little sisters; and their names were Elsie , Lacie Lacie and Tillie Tillie ; and they lived at the bottom of a well....
... """
3.3 节点内容:`.string`属性
如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点。如果一个tag仅有一个子节点,那么这个tag也可以使用 .string 方法,输出结果与当前唯一子节点的 .string 结果相同。
print soup.head.string #The Dormouse‘s story print soup.title.string #The Dormouse‘s story
4、搜索文档树
4.1 find_all(name,attrs,recursive,text,**kwargs)
4.1.1 name参数
name参数可以查找所有名字为name的tag,字符串对象会被自动忽略掉。
(1)传字符串
在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容
print(type(soup.find_all(‘p‘))) #print(soup.find_all(‘p‘)) """ The Dormouse‘s story
,Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
,...
] """
(2)传入正则表达式
如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的macth()来匹配内容。
# 找出所有以b开头的标签 import re for tag in soup.find_all(re.compile(‘^b‘)): print(tag.name) # body # b
(3)传列表
如果传入列表参数,Beautiful Soup会将把与列表中任一元素匹配的内容返回。
print(soup.find_all(["a",‘p‘])) """ [The Dormouse‘s story
,Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
, , Lacie, Tillie,...
] """
4.1.2 keyword参数
soup.find_all(id=‘link2‘) # [Lacie]
4.1.3 text参数
通过text参数可以搜索文档中的字符串内容,与name参数的可选值一样,text参数接受字符串、正则表达式及列表
import re print(soup.find_all(text=‘Tillie‘)) print(soup.find_all(text=["Tillie","Elsie","Lacie"])) print(soup.find_all(text=re.compile("Dormouse"))) """ [‘Tillie‘] [‘Lacie‘, ‘Tillie‘] ["The Dormouse‘s story", "The Dormouse‘s story"] """
4.2 CSS选择器
写CSS时,标签名不加任何修饰,类名前加`.`,id名前加`#`
用soup.select(),返回类型是list
4.2.1 通过标签名查找
print(soup.select("title")) # [The Dormouse‘s story ] print(soup.select("a")) """ [, Lacie, Tillie] """ print(soup.select(‘b‘)) # [The Dormouse‘s story]
4.2.2 通过类名查找
print(soup.select(".sister")) """ [, Lacie, Tillie] """
4.2.3 通过id名查找
print(soup.select("#link1")) # []
4.2.4 组合查找
组合查找即和写css文件时,标签名与类名、id名进行组合的原理是一样的,其各之间需要用空格分开。
print(soup.select("p #link1")) # []
直接子标签查找,则使用`>`分隔
print(soup.select("head > title")) #[The Dormouse‘s story ]
4.2.5 属性查找
查找时还可以加入属性元素,属性需要用中括号括起来,注意属性与标签属于同一节点,所以中间不能加空格,否则会无法匹配到。
print(soup.select(‘a[class="sister"]‘)) #[, Lacie, Tillie] print(soup.select(‘a[href="http://example.com/elsie"]‘)) #[]
同样,属性仍然可以与上述查找方式组合,不在同一节点的空格隔开,同一节点的不加空格
print(soup.select(‘p a[href="http://example.com/elsie"]‘)) #[]
4.2.6 获取内容
select 方法返回的结果都是列表形式,可以遍历形式输出,然后用 get_text() 方法来获取它的内容。
soup = BeautifulSoup(html,‘lxml‘) print(type(soup.select("title"))) #print(soup.select(‘title‘)[0]) # The Dormouse‘s story print(soup.select("title")[0].get_text()) # The Dormouse‘s story for title in soup.select("title"): print(title.get_text()) # The Dormouse‘s story
十三、CSS选择器:BeautifulSoup4
标签:pytho 官方文档 HERE 方法 html标签 XML from ext --
原文地址:https://www.cnblogs.com/nuochengze/p/12863045.html
上一篇:css定位方式有哪几种?
文章标题:十三、CSS选择器:BeautifulSoup4
文章链接:http://soscw.com/index.php/essay/62341.html