BeautifulSoup解析HTML

Hollis 收录于 Python_web_crawler 和 Python网络爬虫

2025-11-03 2025-11-04 约 6500 字预计阅读 13 分钟 - 次阅读 - 条评论

1. 解析HTML的必要性

在网络数据采集过程中，直接处理原始HTML文本如同面对一堆杂乱无章的信息。例如从电商页面提取商品价格、从新闻网站获取报道内容时，原始HTML中混杂着各种标签、属性和嵌套结构，手动分析不仅效率低下，还容易出错。

解析HTML的核心价值在于：

将非结构化的HTML文本转换为结构化数据
提供便捷的方式定位和提取目标信息
处理实际网页中常见的不规范HTML代码
简化数据提取流程，提高代码可读性和可维护性

2. BeautifulSoup简介

2.1 什么是BeautifulSoup

BeautifulSoup是Python的一个强大HTML/XML解析库，它能够：

自动修复不完整或不规范的HTML代码（如缺失闭合标签）
将HTML文档转换为可遍历的树形数据结构（类似DOM树）
提供直观的API用于搜索、定位和提取节点信息

它通常与requests库配合使用，形成完整的数据采集流程：

1

发送请求（requests）→ 获取HTML → 解析为树形结构（BeautifulSoup）→ 提取目标数据

2.2 为什么选择BeautifulSoup

相比其他数据提取方式（如正则表达式），其优势在于：

直观易懂：通过标签名、属性和层级关系定位元素，代码可读性强
容错性高：能够处理现实中常见的"不完美"HTML
功能全面：支持多种搜索策略，满足简单到复杂的各种提取需求
学习成本低：API设计简洁，上手快速

3. 安装与环境准备

3.1 安装BeautifulSoup

使用pip命令安装核心库：

1

pip install beautifulsoup4

为加速下载，推荐使用国内镜像：

1

pip install beautifulsoup4 -i https://pypi.tuna.tsinghua.edu.cn/simple

3.2 安装解析器

BeautifulSoup需要依赖解析器来解析HTML，推荐使用lxml（速度快且容错性好）：

1

pip install lxml

常用解析器对比：

解析器	优势	安装方式
lxml	速度快，支持HTML和XML	`pip install lxml`
html.parser	Python内置，无需额外安装	无需安装
html5lib	兼容性最好，模拟浏览器解析	`pip install html5lib`

4. 基本使用流程

4.1 创建BeautifulSoup对象

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


import requests
from bs4 import BeautifulSoup

# 1. 获取HTML内容（复用requests知识）
url = "https://xuexiqu.cn/python_web_crawler/1.文本格式标签.html"
response = requests.get(url)
response.encoding = 'utf-8'
html_content = response.text

# 2. 创建BeautifulSoup对象
soup = BeautifulSoup(html_content, "html.parser")  # 第二个参数指定解析器

# 3. 格式化输出（便于调试）
print(soup.prettify())  # 自动补全标签并缩进

4.2 核心对象类型

解析后会生成4种主要对象：

Tag：HTML标签对象（如<p>、<div>）
NavigableString：标签内的文本内容
BeautifulSoup：整个文档的根对象
Comment：HTML注释内容

1
2
3
4
5
6
7
8
9


# 示例：获取标签对象
title_tag = soup.title  # 获取<title>标签
print(title_tag)  # 输出：<title>文本标签示例</title>
print(type(title_tag))  # 输出：<class 'bs4.element.Tag'>

# 获取标签内文本
title_text = title_tag.string
print(title_text)  # 输出：文本标签示例
print(type(title_text))  # 输出：<class 'bs4.element.NavigableString'>

5. 搜索文档树

5.1 基本搜索方法

最常用的两个方法：

find()：返回第一个匹配的标签
find_all()：返回所有匹配的标签列表

按标签名搜索

示例1： 以"1.文本格式标签.html"为例，提取页面中的标题和段落标签：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


import requests
from bs4 import BeautifulSoup

# 从网络获取HTML内容
url = "https://xuexiqu.cn/python_web_crawler/1.文本格式标签.html"
response = requests.get(url)
response.encoding = "utf-8"
html_content = response.text

# 创建BeautifulSoup对象
soup = BeautifulSoup(html_content, "lxml")

# 查找所有标题标签
title_tags = soup.find_all(['h1', 'h2', 'h3'])
print("页面中的标题标签：")
for tag in title_tags:
    print(f"{tag.name}: {tag.text}")

# 查找第一个段落标签
first_p = soup.find('p')
if first_p:
    print("\n第一个段落内容：", first_p.text)

运行结果：

1
2
3
4
5
6


页面中的标题标签：
h1: 这是一级标题 - 公司简介
h2: 这是二级标题 - 发展历程
h3: 这是三级标题 - 2023年里程碑

第一个段落内容： 这是一个段落，包含了一些重要的公司信息。我们致力于为客户提供优质的服务。

【课堂练习】按标签名搜索练习1：提取所有超链接

题目描述：

统计网页中所有超链接（a标签）的数量并显示每个链接的完整HTML代码。网页的URL为：“https://xuexiqu.cn/python_web_crawler/2.链接和多媒体标签.html”。

请完成以下任务：

使用 requests 库获取该网页的内容，并设置正确的编码格式为 “utf-8” 以避免中文乱码问题。
使用 BeautifulSoup 和 “lxml” 解析器解析网页的HTML内容。
查找网页中所有的超链接（a标签），并统计总数量。
首先输出链接的总数量，格式为：“总链接数量：X”（其中X为实际数量）。
然后遍历所有链接，逐个输出每个链接的完整HTML代码。

示例输出格式：

1
2
3
4
5


总链接数量：5
<a href="https://example.com">示例链接1</a>
<a href="https://test.com" class="link">示例链接2</a>
<a href="/relative/path">相对路径链接</a>
...

要求：

代码结构清晰，注释完整（如有必要）
正确处理网页编码问题
准确统计所有a标签的数量
输出每个链接的完整HTML代码，包括所有属性和文本内容
使用合适的变量命名

注意： 实际输出内容会根据网页的实际链接数量和内容而变化。

【课堂练习】按标签名搜索练习2：提取所有列表项文本

题目描述：

从一个网页中提取无序列表（ul）和有序列表（ol）中的所有列表项（li）内容。网页的URL为：“https://xuexiqu.cn/python_web_crawler/3.列表示例.html”。

请完成以下任务：

使用 requests 库获取该网页的内容，并指定正确的编码格式为 “utf-8”。
使用 BeautifulSoup 和 “lxml” 解析器解析网页内容。
提取所有无序列表（ul）中的列表项（li）文本，要求只提取直接子列表项（即不递归查找嵌套的li），并在每行前面加上 “- " 作为标记。
提取所有有序列表（ol）中的列表项（li）文本，并为每个列表项按顺序编号（从1开始），格式为 “编号. 文本”。
按照示例格式输出结果。

注意： 实际输出内容会根据网页的实际内容而变化。

示例输出格式：

1
2
3
4
5
6
7
8
9


无序列表项：
- 列表项1
- 列表项2
- 列表项3

有序列表项：
1. 列表项A
2. 列表项B
3. 列表项C

要求：

代码结构清晰，注释完整（如有必要）
正确处理编码问题
使用 recursive=False 参数确保只获取直接子列表项
有序列表的编号从1开始

注意： 实际输出内容会根据网页的实际内容而变化。

按属性搜索

示例1： 以"5.id与class属性.html"为例，通过class和id属性查找元素：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


import requests
from bs4 import BeautifulSoup

# 从网络获取HTML内容
url = "https://xuexiqu.cn/python_web_crawler/5.id与class属性.html"
response = requests.get(url)
response.encoding = "utf-8"
html_content = response.text

# 创建BeautifulSoup对象
soup = BeautifulSoup(html_content, "lxml")

# 查找class为"product-card"的所有元素
product_cards = soup.find_all('div', class_='product-card')
print(f"找到{len(product_cards)}个商品卡片")

for product_card in product_cards:
    product_name = product_card.find('h3', class_='product-name')
    print(f'商品名称：{product_name.text}')

# 查找id为"shopping-cart"的元素
cart = soup.find(id='shopping-cart')
if cart:
    print("\n购物车信息：", cart.text.strip())

运行结果：

1
2
3
4
5
6
7


找到4个商品卡片
商品名称：iPhone 15 Pro
商品名称：三星 Galaxy S23
商品名称：iPad Air
商品名称：MacBook Pro

购物车信息： 购物车: 3 件商品

示例2： 从 “2.链接和多媒体标签.html” 中提取所有超链接（<a>标签）的href属性值，并区分内部链接（以#开头）和外部链接（以http开头）。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


import requests
from bs4 import BeautifulSoup

# 获取HTML内容
url = "https://xuexiqu.cn/python_web_crawler/2.链接和多媒体标签.html"
response = requests.get(url)
response.encoding = "utf-8"
soup = BeautifulSoup(response.text, "lxml")

# 查找所有带href属性的a标签
all_links = soup.find_all('a', href=True)

# 分类链接
internal_links = []  # 内部链接（以#开头）
external_links = []  # 外部链接（以http开头）
other_links = []     # 其他类型链接

for link in all_links:
    href = link.get('href').strip()
    if href.startswith('#'):
        internal_links.append(href)
    elif href.startswith('http'):
        external_links.append(href)
    else:
        other_links.append(href)

# 输出结果
print(f"共找到{len(all_links)}个超链接")

print("\n内部链接（以#开头）：")
for idx, link in enumerate(internal_links, 1):
    print(f"{idx}. {link}")

print("\n外部链接（以http开头）：")
for idx, link in enumerate(external_links, 1):
    print(f"{idx}. {link}")

if other_links:
    print("\n其他类型链接：")
    for idx, link in enumerate(other_links, 1):
        print(f"{idx}. {link}")

输出结果：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


共找到7个超链接

内部链接（以#开头）：
1. #contact

外部链接（以http开头）：
1. https://www.baidu.com
2. https://www.jd.com
3. https://item.jd.com/100209268193.html
4. https://item.jd.com/100089816136.html
5. https://item.jd.com/100142621568.html

其他类型链接：
1. mailto:contact@company.com

按属性搜索练习 1：提取带 disabled 属性的按钮

从 “5.id 与 class 属性.html” 中提取所有包含disabled属性的按钮（button标签），并输出按钮文本。

按属性搜索练习 2：提取所有target属性值为_blank的超链接

从 “2. 链接和多媒体标签.html” 中提取所有target属性值为_blank的超链接（a标签），并输出链接文本和href属性值。

5.2 其他常用搜索方法

方法	功能说明
`find_parent()`	查找直接父节点
`find_parents()`	查找所有祖先节点
`find_next_sibling()`	查找下一个兄弟节点
`find_previous_sibling()`	查找上一个兄弟节点
`find_all_next()`	查找当前节点之后所有符合条件的节点
`find_all_previous()`	查找当前节点之前所有符合条件的节点

以下示例以"4.表格示例.html"展示这些方法的使用：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


import requests
from bs4 import BeautifulSoup

url = "https://xuexiqu.cn/python_web_crawler/4.表格示例.html"
response = requests.get(url)
response.encoding = "utf-8"
soup = BeautifulSoup(response.text, "lxml")

# 统计库存状态
in_stock = soup.find_all('td', class_='in-stock')
out_stock = soup.find_all('td', class_='out-stock')

print(f"有货商品数量：{len(in_stock)}")
print(f"缺货商品数量：{len(out_stock)}")

# 提取缺货商品完整信息
print("\n缺货商品详情：")
for stock_cell in out_stock:
    # 获取所在行的所有单元格
    row_cells = stock_cell.find_parent('tr').find_all('td')
    print(f"商品ID：{row_cells[0].text}")
    print(f"商品名称：{row_cells[1].text}")
    print(f"分类：{row_cells[2].text}")
    print(f"价格：{row_cells[3].text}")
    print(f"上架时间：{row_cells[5].text}\n")

运行结果：

1
2
3
4
5
6
7
8
9


有货商品数量：4
缺货商品数量：1

缺货商品详情：
商品ID：P003
商品名称：无线蓝牙耳机
分类：电子产品
价格：¥399.00
上架时间：2024-01-20

6. 综合示例（解析2.链接和多媒体标签.html）

以下示例针对提供的"2.链接和多媒体标签.html"文件，演示完整的解析过程：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57


from bs4 import BeautifulSoup

# 读取HTML文件
with open("2.链接和多媒体标签.html", "r", encoding="utf-8") as f:
    html_content = f.read()

# 创建解析对象
soup = BeautifulSoup(html_content, "lxml")

# 1. 提取页面基本信息
page_title = soup.title.text
main_title = soup.find("h1").text
print(f"页面标题：{page_title}")
print(f"主标题内容：{main_title}\n")

# 2. 分析导航链接
nav_div = soup.find("div", class_="nav-links")
nav_links = nav_div.find_all("a")
print("导航链接列表：")
for i, link in enumerate(nav_links, 1):
    link_text = link.text
    link_href = link.get("href")
    link_target = link.get("target", "默认窗口")
    print(f"链接{i}：{link_text} → {link_href}（目标：{link_target}）")

# 3. 提取产品展示信息
gallery = soup.find("div", class_="gallery")
product_cards = gallery.find_all("div", class_="image-card")
print("\n产品展示信息：")
for card in product_cards:
    # 获取图片信息
    img_tag = card.find("img")
    img_name = img_tag.get("alt")
    img_src = img_tag.get("src")
    
    # 获取详情链接
    detail_link = card.find("a")
    link_text = detail_link.text
    link_url = detail_link.get("href")
    
    print(f"- {img_name}：")
    print(f"  图片地址：{img_src}")
    print(f"  详情链接：{link_url}（{link_text}）")

# 4. 提取联系信息
contact_div = soup.find("div", id="contact")
contact_title = contact_div.find("h3").text
email_link = contact_div.find("a", href=True)
email = email_link.get("href").replace("mailto:", "")
print(f"\n{contact_title}：{email}")

# 5. 分析文档结构
print("\n页面主要元素层级：")
body_children = soup.body.find_all(recursive=False)  # 只找直接子元素
for child in body_children:
    if child.name:  # 过滤文本节点
        print(f"- {child.name}（class: {child.get('class')}, id: {child.get('id')}）")

运行结果：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


页面标题：现代化产品展示页
主标题内容：🌟 网站导航与图片展示 🌟

导航链接列表：
链接1：百度搜索 → https://www.baidu.com（目标：_blank）
链接2：京东购物 → https://www.jd.com（目标：_blank）
链接3：联系我们 → #contact（目标：_blank）

产品展示信息：
- 产品A展示图：
  图片地址：产品A.jpg
  详情链接：https://item.jd.com/100209268193.html（查看产品A详情）
- 产品B展示图：
  图片地址：产品B.jpg
  详情链接：https://item.jd.com/100089816136.html（查看产品B详情）
- 产品C展示图：
  图片地址：产品C.jpg
  详情链接：https://item.jd.com/100142621568.html（查看产品C详情）

📧 联系我们：contact@company.com

页面主要元素层级：
- div（class: ['container'], id: None）

7. 常见问题与解决方案

7.1 中文乱码问题

当提取的文本出现乱码时，通常是编码不一致导致：

1
2
3
4
5
6


# 解决方案：指定正确的编码
soup = BeautifulSoup(html_content, "lxml", from_encoding="utf-8")

# 或在读取文件时指定编码
with open("file.html", "r", encoding="utf-8") as f:
    html_content = f.read()

7.2 动态内容无法提取

问题：JavaScript动态生成的内容无法被BeautifulSoup解析

解决方案：

分析网站API接口，直接请求数据（推荐）
使用Selenium或Playwright等工具模拟浏览器渲染

7.3 复杂嵌套结构定位

对于深层嵌套的元素，可以组合使用多种搜索方法：

1
2


# 多层级定位示例
products = soup.find("div", class_="gallery").find_all("div", class_="image-card")

7.4 反爬机制应对

设置合理的请求头（特别是User-Agent）
添加请求间隔（使用time.sleep()）
对提取的数据进行合理缓存
必要时使用代理IP

8. 本章总结

8.1 核心知识点

BeautifulSoup的核心作用是将HTML文档转换为可操作的树形结构
最常用的方法是find()和find_all()，用于定位标签
可以通过标签名、属性、文本内容等多种方式搜索节点
完整的数据采集流程：获取HTML → 解析 → 提取 → 处理

8.2 实践要点

根据HTML结构选择合适的解析策略
优先使用id和class属性定位（通常更稳定）
处理动态内容时需区分静态HTML和JavaScript生成内容
注意网站的robots协议和反爬机制，合法合规采集数据

8.3 后续学习方向

结合CSS选择器进行更灵活的定位（soup.select()方法）
学习XPath解析技术，与BeautifulSoup互为补充
掌握数据存储方法（CSV、JSON、数据库等）
学习异步请求和分布式爬虫技术，提高采集效率

收录于合集・Python网络爬虫 5

HTML与DOM结构认知