BeautifulSoup4-提取HTML中所有URL链接
2021-01-30 03:13
YPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
标签:ever this demo url mpi can ext ack html
‘‘‘
提取HTML中所有URL链接
‘‘‘
import requests
from bs4 import BeautifulSoup
import re
# r = requests.get("https://python123.io/ws/demo.html")
# demo = r.text
demo = """
The demo python introduces several python courses.
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
Basic Python and Advanced Python.
"""
"""
find_all(name, attrs, recursive, string, **kwargs)方法:
soup(..) 等价于 soup.find_all(..)
"""
soup = BeautifulSoup(demo, "html.parser")
for link in soup.find_all(‘a‘): # 1、搜索到所有标签
print(link.get("href")) # 2、解析标签格式,提取href后的链接内容
print(soup.find_all(‘a‘)) # 查找标签
‘[Basic Python, Advanced Python]‘
print(soup.find_all([‘a‘, ‘b‘])) # 同时查找标签
for tag in soup.find_all(True): # 获取所有标签
print(tag.name)
‘‘‘
html
head
title
body
p
b
p
a
a
‘‘‘
# 只显示以b开头的标签,包括和标签元素
for tag in soup.find_all(re.compile(‘b‘)): # 正则表达式查找以开头的标签元素
print(tag.name)
print(soup.find_all(‘p‘, ‘course‘)) # 返回
标签中,属性值为"course"的标签元素
print(soup.find_all(id = "link1")) # 返回属性中id域等于"link1"的标签元素
print(soup.find_all(id = re.compile("link"))) # 返回属性中id域以"link"开头的所有标签元素
print(soup.find_all(attrs={"class": "py1"}))
BeautifulSoup4-提取HTML中所有URL链接
标签:ever this demo url mpi can ext ack html
原文地址:https://www.cnblogs.com/pencil2001/p/13197203.html
文章标题:BeautifulSoup4-提取HTML中所有URL链接
文章链接:http://soscw.com/essay/48963.html