python爬虫进阶

2021-03-07 23:28

阅读：527

标签：sel urllib 用户 path mpi web trident start xpath

获取豆瓣https://movie.douban.com/top250的，第一页前25个电影名字

我的答案：

import requests

from bs4 import BeautifulSoup

head={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36"}

res=requests.get("https://movie.douban.com/top250",headers=head)



soup=BeautifulSoup(res.content,"html.parser")

for i in range(1,26):

    get=soup.select("#content > div > div.article > ol > li:nth-child(%s) > div > div.info > div.hd > a > span:nth-child(1)"%i)

    print(get[0].string)

爬取https://movie.douban.com/top250，250部电影的名字。

我的答案：

import requests

from bs4 import BeautifulSoup

head={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"}

for j in range(0,10):

      res = requests.get("https://movie.douban.com/top250?start="+"%s"%(25*j), headers=head)

      soup=BeautifulSoup(res.content,"html.parser")

      for i in range(1,26):

            get=soup.select("#content > div > div.article > ol > li:nth-child(%s) > div > div.info > div.hd > a > span:nth-child(1)"%i)

            print(get[0].string)

将http://www.netbian.com/s/huyan/index.htm

中的所有图片爬取到本地文件中，并以 第1张.jpg  第2张.jpg……保存。

我的答案：


import requests
from bs4 import BeautifulSoup
head={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36"}
res = requests.get("http://www.netbian.com/s/huyan/index.htm", headers=head)
soup = BeautifulSoup(res.text, "html.parser")
for i in range(4,20):
      get=soup.select("#main > div.list > ul > li:nth-child(%s) > a > img "%i)
      datel=str(get)
      datelist=datel.split(‘‘‘\"‘‘‘)
      print(datelist[3])
      URL=datelist[3]
      res1 = requests.get(URL, headers=head)
      date=open("D:\\Project\\%s.jpg"%i,"wb+")
      date.write(res1.content)
      date.close()

网址：https://www.kugou.com/yy/rank/home/1-8888.html

爬取第一页的歌曲名和歌手信息

我的答案：

import requests

from lxml import etree

head={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36"}

res=requests.get("https://www.kugou.com/yy/rank/home/1-8888.html",headers=head)

sele=etree.HTML(res.text)

for i in range(1,23):

    temp=sele.xpath(‘//*[@id="rankWrap"]/div[2]/ul/li[%s]/a/text()‘%i)

    print(temp)

使用正则表达式，匹配多个"zz任意数字"，并输出显示

string="asdasdzz234234adas,asdasdzz2348weqesad,zz657878asd"



我的答案：

import re

line="asdasdzz234234adas,asdasdzz2348weqesad,zz657878asd"

pat=re.compile(r‘zz\d+‘)

date=pat.findall(line)

print(date)

使用用户代理池来

网址：https://www.kugou.com/yy/rank/home/1-8888.html

爬取TOP500的500首歌曲名和歌手信息



我的答案：

import requests

from lxml import etree

import random

uapools=[

    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",

    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",

    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0",

    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)"

]



def ua():

    thisua=random.choice(uapools)

    head={"User-Agent":thisua}

    return head

head=ua()

for j in range(1,23):

    res=requests.get("https://www.kugou.com/yy/rank/home/%s-8888.html"%j,headers=head)

    sele=etree.HTML(res.text)

    for i in range(1,23):

        temp=sele.xpath(‘//*[@id="rankWrap"]/div[2]/ul/li[%s]/a/text()‘%i)

        print(temp)

for n in range(1,17):

    res1=requests.get("https://www.kugou.com/yy/rank/home/23-8888.html",headers=head)

    sele = etree.HTML(res1.text)

    temp1=sele.xpath(‘//*[@id="rankWrap"]/div[2]/ul/li[%s]/a/text()‘%n)

    print(temp1)

聊天格式如下，并把{br}换成回车

我：小猫

小K：崔燚

我：hello

小K：{face:14}Hi~

我：讲个笑话

小K：★ 关于国际理论{br}一个企业人士登机后发现他很幸运的坐在一个美女旁边。彼此交换简短的寒喧之后，他注意到她正在看一份性学统计的手册，于是他问她那本书，她答道：{br}

我：



我的答案：

import requests

import urllib.request

import json

shuru=input("我：")

while(shuru!=0):

    key=urllib.request.quote(shuru)

    res=requests.get("http://api.qingyunke.com/api.php?key=free&appid=0&msg="+key)

    ddd=json.loads(res.text)

    s = ddd["content"]

    s_replace = s.replace(‘{br}‘, "\n")

    print("小K：",s_replace)

    shuru=input("我：")

python爬虫进阶

标签：sel urllib 用户 path mpi web trident start xpath

原文地址：https://www.cnblogs.com/toooof/p/14254086.html

上一篇：java中的事务，四大特性，并发造成的问题，隔离级别

下一篇：Java局部变量

文章来自：搜素材网的编程语言模块，转载请注明文章出处。
文章标题：python爬虫进阶
文章链接：http://soscw.com/index.php/essay/61546.html

亲，登录后才可以留言！

python爬虫进阶

评论

热门文章

推荐文章

最新文章

置顶文章