Python爬虫的步骤和工具

2021-07-09 02:04

阅读：715

标签：str 形式 urllib 内容 imp html www compile 设置

#四个步骤

1.查看crawl内容的源码格式 crawl的内容可以是 url(链接），文字，图片，视频

2.请求网页源码　　　　　　　　（可能要设置）代理，限速，cookie

3.匹配　　　　　　　　　　　　用正则表达式匹配

4.保存数据　　　　　　　　　　文件操作

#两个基本工具（库）

1.urllib

2.requests

#使用reuests库的一个例子，抓取可爱图片

import requests #导入库
import re

url =r‘https://www.woyaogexing.com/tupian/keai‘ #链接
response =requests.get(url) #get()函数，得到网页
response.encoding =‘utf-8‘　　　　　　　　　　#让源码中的中文正常显示
html =response.text　　　　　　　　　　　　　#加载网页源码
strs =‘

.*?src="http://www.soscw.com/(.*?)".*?>‘ #正则表达式
patern =re.compile(strs,re.S)　　　　　　　　　#封装成对象，以便多次使用
items =re.findall(patern,html)　　　　　　　　　#匹配
for i in items:
    with open(‘%d.jpg‘%items.index(i),‘wb‘) as file: #新建文件，以二进制写形式‘wb‘
        url =‘https:‘+i
        file.write(requests.get(url).content)　　　　#写入数据，图片是二进制数据

Python爬虫的步骤和工具

标签：str 形式 urllib 内容 imp html www compile 设置

原文地址：https://www.cnblogs.com/vvlj/p/9580423.html

上一篇：栈应用之括号匹配问题（Python 版）

下一篇：使用Eclipse搭建JavaWeb开发环境的几个基本问题

文章来自：搜素材网的编程语言模块，转载请注明文章出处。
文章标题：Python爬虫的步骤和工具
文章链接：http://soscw.com/index.php/essay/102565.html

亲，登录后才可以留言！

Python爬虫的步骤和工具

评论

热门文章

推荐文章

最新文章

置顶文章