Python爬虫的三种数据解析方式

2020-12-13 01:47

阅读：465

标签：coding windows imp 比较就是 pytho 相关中文 tor

数据解析方式　　

　　`- 正则`

　　`- xpath`

　　`- bs4`

数据解析的原理：

标签的定位
提取标签中存储的文本数据或者标签属性中存储的数据

正则

# 正则表达式
 单字符：
        . : 除换行以外所有字符
        [] ：[aoe] [a-w] 匹配集合中任意一个字符
        \d ：数字  [0-9]
        \D : 非数字
        \w ：数字、字母、下划线、中文
        \W : 非\w
        \s ：所有的空白字符包,括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。
        \S : 非空白
    数量修饰：
        * : 任意多次  >=0
        + : 至少1次   >=1
        ? : 可有可无  0次或者1次
        {m} ：固定m次 hello{3,}
        {m,} ：至少m次
        {m,n} ：m-n次
    边界：
        $ : 以某某结尾 
        ^ : 以某某开头
    分组：
        (ab)  
    贪婪模式： .*
    非贪婪（惰性）模式： .*?

    re.I : 忽略大小写
    re.M ：多行匹配
    re.S ：单行匹配

    re.sub(正则表达式, 替换内容, 字符串)

#爬取糗事百科中所有的糗图图片数据
import os
import requests
import re
from urllib import request
if not os.path.exists(‘./qiutu‘):
    os.mkdir(‘./qiutu‘)
headers = {
    ‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36‘
}

url = ‘https://www.qiushibaike.com/pic/‘
page_text = requests.get(url=url,headers=headers).text

ex = ‘.*?‘
img_url = re.findall(ex,page_text,re.S)
for url in img_url:
    url = ‘https:‘+url
    img_name = url.split(‘/‘)[-1]
    img_path = ‘./qiutu/‘+img_name
    request.urlretrieve(url,img_path)
    print(img_name,‘下载成功！！！‘)


bs4解析


解析原理：

实例化一个Beautifulsoup的对象，且将页面源码数据加载到该对象中
使用该对象的相关属性和方法实现标签定位和数据提取



环境的安装：

pip install bs4
pip install lxml



实例化Beautifulsoup对象

BeautifulSoup(page_text,‘lxml‘):将从互联网上请求到的页面源码数据加载到该对象中
BeautifulSoup(fp,‘lxml‘)：将本地存储的一样页面源码数据加载到该对象中




属性


soup.a.attrs 返回一字典，里面是所有属性和值
soup.a[‘href‘] 获取href属性


文本


soup.a.string
soup.a.text
soup.a.get_text()



find方法



#find只能找到符合要求的第一个标签，他返回的是一个对象
soup.find(‘a‘)
soup.find(‘a‘, class_=‘xxx‘)
soup.find(‘a‘, title=‘xxx‘)
soup.find(‘a‘, id=‘xxx‘)
soup.find(‘a‘, id=re.compile(r‘xxx‘))



find_all



#返回一个列表，列表里面是所有的符合要求的对象
soup.find_all(‘a‘)
soup.find_all(‘a‘, class_=‘wang‘)
soup.find_all(‘a‘, id=re.compile(r‘xxx‘))
soup.find_all(‘a‘, limit=2)   #提取出前两个符合要求的a



select



#选择，选择器 css中
常用的选择器
标签选择器、id选择器、类选择器
层级选择器**
div h1 a      后面的是前面的子节点即可
div > h1 > a  后面的必须是前面的直接子节点
属性选择器
input[name=‘hehe‘]
select(‘选择器的‘)
返回的是一个列表，列表里面都是对象
find find_all select不仅适用于soup对象，还适用于其他的子对象，如果调用子对象的select方法，那么就是从这个子对象里面去找符合这个选择器的标签


#爬取古诗文网的三国演义小说

url = ‘http://www.shicimingju.com/book/sanguoyanyi.html‘
page_text = requests.get(url=url,headers=headers).text
#数据解析：标题和url
soup = BeautifulSoup(page_text,‘lxml‘)
li_list = soup.select(‘.book-mulu > ul > li‘)
fp = open(‘./sanguo.txt‘,‘w‘,encoding=‘utf-8‘)
for li in li_list:
    title = li.a.string
    detail_url = ‘http://www.shicimingju.com‘+li.a[‘href‘]
    #单独对详情页发起请求获取源码数据
    detail_page_text = requests.get(url=detail_url,headers=headers).text
    soup = BeautifulSoup(detail_page_text,‘lxml‘)
    content = soup.find(‘div‘,class_="chapter_content").text
    
    fp.write(title+‘\n‘+content+‘\n‘)
    print(title,‘:下载成功！‘)
    
fp.close()


xpath解析：
- 解析效率比较高
- 通用性最强的

- 环境安装：pip install lxml
- 解析原理：
    - 实例化一个etree对象且将即将被解析的页面源码数据加载到该对象中
    - 使用etree对象中的xpath方法结合着xpath表达式进行标签定位和数据提取
- 实例化etree对象
    - etree.parse(‘本地文件路径‘)
    - etree.HTML(page_text)


#爬取全国城市名称
import requests
from lxml import etree
# UA伪装
headers = {
    ‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36‘
}
url = ‘https://www.aqistudy.cn/historydata/‘
page_text = requests.get(url=url,headers=headers).text

tree = etree.HTML(page_text)
# hot_city = tree.xpath(‘//div[@class="bottom"]/ul/li/a/text()‘)
# all_city = tree.xpath(‘//div[@class="bottom"]/ul/div[2]/li/a/text()‘)
# all_city

tree.xpath(‘//div[@class="bottom"]/ul/div[2]/li/a/text() | //div[@class="bottom"]/ul/li/a/text()‘


 
 
 
Python爬虫的三种数据解析方式
标签：coding   windows   imp   比较   就是   pytho   相关   中文   tor   
原文地址：https://www.cnblogs.com/q455674496/p/11000348.html

上一篇：nginx简单的发布多个网站

下一篇：jQuery is not defined

文章来自：搜素材网的编程语言模块，转载请注明文章出处。
文章标题：Python爬虫的三种数据解析方式
文章链接：http://soscw.com/essay/24230.html

亲，登录后才可以留言！

Python爬虫的三种数据解析方式

数据解析方式

`- 正则`

`- xpath`

`- bs4`

数据解析的原理：

正则

bs4解析

解析原理：

环境的安装：

实例化Beautifulsoup对象

find方法

find_all

select

xpath解析：

评论

热门文章

推荐文章

最新文章

置顶文章