Python爬虫PyQuery库基本用法入门教程

2018-10-15 17:15

阅读:331

本文实例讲述了Python爬虫PyQuery库基本用法。分享给大家供大家参考,具体如下:

PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严格实现。语法与 jQuery 几乎完全相同,所以不用再去费心去记一些奇怪的方法了。

官网地址:

jQuery参考文档:

1、字符串的初始化

from pyquery import PyQuery as pq html = <div> <ul> <li class=item-0>first item</li> <li class=item-1><a href=link2.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >second item</a></li> <li class=item-0 active><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li> <li class=item-1 active><a href=link4.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fourth item</a></li> <li class=item-0><a href=link5.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fifth item</a></li> </ul></div> doc = pq(html) print(doc) print(type(doc)) print(doc(li))

<div>
<ul>
<li class=item-0>first item</li>
<li class=item-1><a href=link2.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >second item</a></li>
<li class=item-0 active><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li>
<li class=item-1 active><a href=link4.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fourth item</a></li>
<li class=item-0><a href=link5.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fifth item</a></li>
</ul></div>
<class pyquery.pyquery.PyQuery>
<li class=item-0>first item</li>
<li class=item-1><a href=link2.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >second item</a></li>
<li class=item-0 active><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li>
<li class=item-1 active><a href=link4.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fourth item</a></li>
<li class=item-0><a href=link5.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fifth item</a></li>

2、打开html文件

注意路劲问题

from pyquery import PyQuery as pq doc = pq(filename=index.html) print(doc) print(doc(head))

<title>Title</title>
</head>
<body>
<div>
<ul>
<li class=item-0>first item</li>
<li class=item-1><a href=link2.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >second item</a></li>
<li class=item-0 active><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li>
<li class=item-1 active><a href=link4.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fourth item</a></li>
<li class=item-0><a href=link5.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fifth item</a></li>
</ul></div>
</body>
</html>
<head>
<meta charset=UTF-8/>
<title>Title</title>
</head>

3、打开某个网站

doc = pq(

4、基于CSS选择器查找

from pyquery import PyQuery as pq html = <div> <ul id = haha> <li class=item-0>first item</li> <li class=item-1><a href=link2.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >second item</a></li> <li class=item-0 active><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li> <li class=item-1 active><a href=link4.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fourth item</a></li> <li class=item-0><a href=link5.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fifth item</a></li> </ul></div> doc = pq(html) print(doc) #id等于haha下面的class等于item-0下的a标签下的span标签(注意层级关系以空格隔开) print(doc(#haha .item-0 a span))

<div>
<ul id=haha>
<li class=item-0>first item</li>
<li class=item-1><a href=link2.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >second item</a></li>
<li class=item-0 active><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li>
<li class=item-1 active><a href=link4.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fourth item</a></li>
<li class=item-0><a href=link5.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fifth item</a></li>
</ul></div>
<span class=bold>third item</span>

5、可以通过已经查找的标签,查找这个标签下的子标签或者父标签,而不用从头开始查找。

from pyquery import PyQuery as pq html = <div class=‘content> <ul id = haha> <li class=item-0>first item</li> <li class=item-1><a href=link2.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >second item</a></li> <li class=item-0 active><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li> <li class=item-1 active><a href=link4.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fourth item</a></li> <li class=item-0><a href=link5.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fifth item</a></li> </ul></div> doc = pq(html) item = doc(div ul) print(item) #我们可以通过已经查找到的标签,再此查找这个标签下面的标签 print(item.parent()) print(item.children())

<ul id=haha>
<li class=item-0>first item</li>
<li class=item-1><a href=link2.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >second item</a></li>
<li class=item-0 active><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li>
<li class=item-1 active><a href=link4.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fourth item</a></li>
<li class=item-0><a href=link5.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fifth item</a></li>
</ul>
<div class=‘content’>
<ul id=haha>
<li class=item-0>first item</li>
<li class=item-1><a href=link2.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >second item</a></li>
<li class=item-0 active><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li>
<li class=item-1 active><a href=link4.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fourth item</a></li>
<li class=item-0><a href=link5.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fifth item</a></li>
</ul></div>
<li class=item-0>first item</li>
<li class=item-1><a href=link2.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >second item</a></li>
<li class=item-0 active><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li>
<li class=item-1 active><a href=link4.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fourth item</a></li>
<li class=item-0><a href=link5.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fifth item</a></li>

from pyquery import PyQuery as pq html = <div class=‘content> <ul id = haha> <li class=item-0>first item</li> <li class=item-1><a href=link2.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >second item</a></li> <li class=item-0 active><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li> <li class=item-1 active><a href=link4.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fourth item</a></li> <li class=item-0><a href=link5.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fifth item</a></li> </ul></div> doc = pq(html) item = doc(div ul) print(item) #注意这里查找ul标签的所有子标签,也就是li标签,下面是查找class属性的标签,如果你把class换成href肯定不行,它指的只是儿子并不是子子孙孙 print(item.children([class]))

6、获取属性值

from pyquery import PyQuery as pq html = <div class=‘content> <ul id = haha> <li class=item-0>first item</li> <li class=item-1><a href=link2.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >second item</a></li> <li class=item-0 active><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li> <li class=item-1 active><a href=link4.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fourth item</a></li> <li class=item-0><a href=link5.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fifth item</a></li> </ul></div> doc = pq(html) #注意class=item-0 active是一个class的属性,但是在pyquery里面要是中间也是空格隔开的话, #就变成了item-0下的active标签下的a标签了,所以这里空格必须改成点 item = doc(.item-0.active a) print(type(item)) print(item) #获取属性值的两种方法 print(item.attr.href) print(item.attr(href))

<class pyquery.pyquery.PyQuery>
<a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a>
link3.html
link3.html

7、获取标签的内容

from pyquery import PyQuery as pq html = <div class=‘content> <ul id = haha> <li class=item-0>first item</li> <li class=item-1><a href=link2.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >second item</a></li> <li class=item-0 active><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li> <li class=item-1 active><a href=link4.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fourth item</a></li> <li class=item-0><a href=link5.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fifth item</a></li> </ul></div> doc = pq(html) a = doc(a).text() print(a)

#结果很有趣,他是找到所有标签的值,然后给连到一起打出来,就像一段话
second item third item fourth item fifth item

8、Dom操作

①、属性的增加删除操作

from pyquery import PyQuery as pq html = <div class=‘content> <ul id = haha> <li class=item-0>first item</li> <li class=item-1><a href=link2.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >second item</a></li> <li class=item-0 active><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li> <li class=item-1 active><a href=link4.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fourth item</a></li> <li class=item-0><a href=link5.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fifth item</a></li> </ul></div> doc = pq(html) li = doc(.item-0.active) print(li) #删除classactive print(li.removeClass(active)) #增加class属性haha print(li.addClass(haha))

<li class=item-0 active><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li>
<li class=item-0><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li>
<li class=item-0 haha><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li>

②、attrs和css

注意:下列操作有则改之,无则加之。

from pyquery import PyQuery as pq html = <div class=‘content> <ul id = haha> <li class=item-0>first item</li> <li class=item-1><a href=link2.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >second item</a></li> <li class=item-0 active><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li> <li class=item-1 active><a href=link4.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fourth item</a></li> <li class=item-0><a href=link5.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow >fifth item</a></li> </ul></div> doc = pq(html) li = doc(.item-0.active) print(li) print(li.attr(id,id_test)) print(li.css(font-size,20px))

<li class=item-0 active><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li>
<li class=item-0 active id=id_test><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li>
<li class=item-0 active id=id_test style=font-size: 20px><a href=link3.html rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow ><span class=bold>third item</span></a></li>

③、删除某个标签,在爬去过程中我们通常爬去一下标签或者内容下来的时候总会有些不想要的标签,这个时候我们可以用下面的类似方法删除这个标签。

first item second item third item fourth item fifth item
first item

更多关于Python相关内容可查看本站专题:《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家Python程序设计有所帮助。


评论


亲,登录后才可以留言!