当前位置：首页 > news >正文

HtmlAgilityPack 操作详解

news 2025/8/26 4:06:35

1.安装 HtmlAgilityPack

2. 示例 HTML

3. 使用 HtmlAgilityPack 进行 HTML 解析与操作

4. 代码详解

1.加载html文档

2.选择元素

3. 提取属性

4.修改属性

5.常用的几种获取元素的 XPath 写法

HtmlAgilityPack：

轻量且高效，适合进行常规的 HTML 解析。
由于其轻量化设计，在只需简单提取或修改元素内容时，HtmlAgilityPack 会显得更快。
对于层级较深或大规模的 HTML 文档，HtmlAgilityPack 也会处理得较为流畅。
文件大小较小，功能单一，适用于解析 HTML 和使用 XPath 查询。
没有内置对 CSS 选择器的支持，需要通过额外库扩展（如 Fizzler）。

1.安装 HtmlAgilityPack

通过 NuGet 包管理器安装 HtmlAgilityPack：

2. 示例 HTML

假设我们有以下 HTML 内容，需要解析和操作：

 <!DOCTYPE html><html><head><title>HtmlAgilityPack Example</title><style>.highlight { color: yellow; }#main { background-color: #f0f0f0; }</style></head><body><h1 id='main-heading' class='highlight'>Welcome to HtmlAgilityPack</h1><p>This is a <span class='highlight'>simple</span> example.</p><a href='https://example.com' target='_blank'>Visit Example.com</a><ul id='items'><li class='item'>Item 1</li><li class='item'>Item 2</li><li class='item'>Item 3</li></ul><input type='text' id='username' value='JohnDoe' /><input type='password' id='password' /></body></html>

3. 使用 HtmlAgilityPack 进行 HTML 解析与操作

以下是一个详细的 C# 示例，展示如何使用 HtmlAgilityPack 进行各种操作：

using HtmlAgilityPack;
using System;
using System.Linq;class Program
{static void Main(string[] args){// 示例 HTML 内容string html = @"<!DOCTYPE html><html><head><title>HtmlAgilityPack Example</title><style>.highlight { color: yellow; }#main { background-color: #f0f0f0; }</style></head><body><h1 id='main-heading' class='highlight'>Welcome to HtmlAgilityPack</h1><p>This is a <span class='highlight'>simple</span> example.</p><a href='https://example.com' target='_blank'>Visit Example.com</a><ul id='items'><li class='item'>Item 1</li><li class='item'>Item 2</li><li class='item'>Item 3</li></ul><input type='text' id='username' value='JohnDoe' /><input type='password' id='password' /></body></html>";// 1. **加载 HTML 文档**HtmlDocument document = new HtmlDocument();document.LoadHtml(html);// 2. **选择元素**// 使用 XPath 选择所有具有 class 'highlight' 的元素var highlights = document.DocumentNode.SelectNodes("//*[@class='highlight']");Console.WriteLine("Elements with class 'highlight':");foreach (var elem in highlights){Console.WriteLine($"- <{elem.Name}>: {elem.InnerText}");}// 使用 ID 选择器选择特定元素var mainHeading = document.GetElementbyId("main-heading");if (mainHeading != null){Console.WriteLine($"\nElement with ID 'main-heading': {mainHeading.InnerText}");}// 选择所有 <a> 标签var links = document.DocumentNode.SelectNodes("//a");Console.WriteLine("\nAll <a> elements:");foreach (var link in links){Console.WriteLine($"- Text: {link.InnerText}, Href: {link.GetAttributeValue("href", "")}, Target: {link.GetAttributeValue("target", "")}");}// 选择所有具有 class 'item' 的 <li> 元素var items = document.DocumentNode.SelectNodes("//li[@class='item']");Console.WriteLine("\nList items with class 'item':");foreach (var item in items){Console.WriteLine($"- {item.InnerText}");}// 选择特定类型的输入元素var textInput = document.DocumentNode.SelectSingleNode("//input[@type='text']");var passwordInput = document.DocumentNode.SelectSingleNode("//input[@type='password']");Console.WriteLine($"\nText Input Value: {textInput.GetAttributeValue("value", "")}");Console.WriteLine($"Password Input Value: {passwordInput.GetAttributeValue("value", "")}");// 3. **提取和修改属性**// 获取第一个链接的 href 属性string firstLinkHref = links.First().GetAttributeValue("href", "");Console.WriteLine($"\nFirst link href: {firstLinkHref}");// 修改第一个链接的 href 属性links.First().SetAttributeValue("href", "https://newexample.com");Console.WriteLine($"Modified first link href: {links.First().GetAttributeValue("href", "")}");// 4. **提取和修改文本内容**// 获取第一个段落的文本内容var firstParagraph = document.DocumentNode.SelectSingleNode("//p");Console.WriteLine($"\nFirst paragraph text: {firstParagraph.InnerText}");// 修改第一个段落的文本内容firstParagraph.InnerHtml = "This is an <strong>updated</strong> example.";Console.WriteLine($"Modified first paragraph HTML: {firstParagraph.InnerHtml}");// 5. **操作样式**// 获取元素的 class 属性string h1Classes = mainHeading.GetAttributeValue("class", "");Console.WriteLine($"\nMain heading classes: {h1Classes}");// 添加一个新的 classmainHeading.SetAttributeValue("class", h1Classes + " new-class");Console.WriteLine($"Main heading classes after adding 'new-class': {mainHeading.GetAttributeValue("class", "")}");// 移除一个 class (手动实现，HtmlAgilityPack 不支持内置的 class 操作)h1Classes = mainHeading.GetAttributeValue("class", "").Replace("highlight", "").Trim();mainHeading.SetAttributeValue("class", h1Classes);Console.WriteLine($"Main heading classes after removing 'highlight': {mainHeading.GetAttributeValue("class", "")}");// 6. **遍历和查询 DOM**// 遍历所有子节点的标签名Console.WriteLine("\nChild elements of <body>:");var bodyChildren = document.DocumentNode.SelectSingleNode("//body").ChildNodes;foreach (var child in bodyChildren){if (child.NodeType == HtmlNodeType.Element){Console.WriteLine($"- <{child.Name}>");}}// 查找包含特定文本的元素var elementsWithText = document.DocumentNode.SelectNodes("//*[contains(text(), 'simple')]");Console.WriteLine("\nElements containing the text 'simple':");foreach (var elem in elementsWithText){Console.WriteLine($"- <{elem.Name}>: {elem.InnerText}");}// 7. **生成和输出修改后的 HTML**string modifiedHtml = document.DocumentNode.OuterHtml;Console.WriteLine("\nModified HTML:");Console.WriteLine(modifiedHtml);}
}

4. 代码详解

1.加载html文档

HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);

2.选择元素

使用 XPath 选择所有具有相同特征的元素集合 .SelectNodes("XPath");
```
var elements = document.DocumentNode.SelectNodes("//*[@class='class']");
```

通过 XPath 选择具有独立性的单一元素 .SelectSingleNode("XPath");

var div = document.DocumentNode.SelectSingleNode("//div[@id='title-content']");

使用 ID 选择器选择特定元素 .GetElementbyId("id");
```
var element = document.GetElementbyId("id");
```
获取子节点（注意这里是直接子节点集合，即第一级的子节点。不包括更深层次的子孙节点。）.ChildNodes;
```
var bodyChildren = document.DocumentNode.SelectSingleNode("//body").ChildNodes;
```
获取元素的第一个子节点 .First();
```
var firstChildNode = element.First();
```

3. 提取属性

假设我们要对下面这个 element 进行操作

var element = document.GetElementbyId("id");

提取元素内部 html
```
string innerHtml = element.InnerHtml;
```
提取含元素自身的 html
```
string outerHtml = element.OuterHtml;
```
提取文本
```
string text= element.InnerText;
```

提取属性

string _value = element.GetAttributeValue("value", "");

提取 href

string href = element.GetAttributeValue("href", "");

4.修改属性

修改 href

element.SetAttributeValue("href", "https://newexample.com");

添加 class

 element.SetAttributeValue("class", oldClasses + " new-class");

修改 class

// 移除一个 class (手动实现，HtmlAgilityPack 不支持内置的 class 操作)
newClasses = element.GetAttributeValue("class", "").Replace("highlight", "").Trim();
element.SetAttributeValue("class", newClasses);

5.常用的几种获取元素的 XPath 写法

通过 id 获取

var element = document.DocumentNode.SelectSingleNode("//*[@id='id']");

通过 class 获取

var element = document.DocumentNode.SelectNodes("//*[@class='class']");

通过匹配文本获取

var elementsWithText = document.DocumentNode.SelectNodes("//*[contains(text(), 'simple')]");

通过 class 和匹配文本相结合获取

var elements = doc.DocumentNode.SelectNodes("//span[@class='title-content-title' and contains(text(), '包含的文本')]");

查看全文

http://www.lryc.cn/news/474745.html

基于SSM医院门诊互联电子病历管理系统的设计

【读书笔记/深入理解K8S】集群网络

【专有网络VPC】连接公网

论文 | Legal Prompt Engineering for Multilingual Legal Judgement Prediction

国科安芯抗辐照MCU和CANFD芯片发布

C++ 并发专题 - 无锁数据结构（概述）

NLP领域的经典算法和模型

提升安全上网体验：Windows 11 启用 DOH（阿里公共DNS）

论文概览 |《Journal of Transport Geography》2024.10 Vol.120

yum不能使用: cannot find a valid baseurl for repo: base/7/x86_64

什么品牌的护眼台灯比较好？五款护眼效果比较明显的护眼台灯

HTML 表单设计与验证

qt QDialog详解

supervisor服务“Exited too quickly“解决方案

动态规划 —— 路径问题-地下城游戏

沈阳乐晟睿浩科技有限公司抖音小店短视频时代的电商蓝海

ubuntu20.04安装ros与rosdep

推理加速papers

【02基础】- RabbitMQ基础

vue3中跨层传递provide、inject

Nacos-1.4.6升级2.3.2

东识集中文印管理系统|DW-S408系统的主要功能

text-foreground讲解

数字IC后端实现之Innovus Place跑完density爆涨案例分析

【牛客刷题实战】二叉树遍历

消息队列mq有哪些缺点？

【CENet】多模态情感分析的跨模态增强网络

动态代理：面向接口编程，屏蔽RPC处理过程

HTTP 405 Method Not Allowed：解析与解决

推荐一款CAD/CAM设计辅助工具：Mastercam

1.安装 HtmlAgilityPack

2. 示例 HTML

3. 使用 HtmlAgilityPack 进行 HTML 解析与操作

4. 代码详解

1.加载html文档

2.选择元素

3. 提取属性

4.修改属性

5.常用的几种获取元素的 XPath 写法

相关文章：